slow script trying to find unique values in list

slow script trying to find unique values in list - python

I've got a problem in Python:
I want to find how many UNIQUE a**b values exist if:
2 ≤ a ≤ 100and 2 ≤ b ≤ 100?
I wrote the following script, but it's too slow on my laptop (and doesnt even produce the results):
List=[]
a = 2
b = 2
c = pow(a, b)
while b != 101:
while a != 101:
if List.count(c) == 0:
List.append(c)
a += 1
b += 1
print len(List)
Is it good? Why is it slow?

This code doesn't work; it's an infinite loop because of the way you don't increment a on every iteration of the loop. After you fix that, you still won't get the right answer because you never reset a to 2 when b reaches 101.
Then, List will ever contain only 4 because you set c outside the loop to 2 ** 2 and never change it inside the loop. And when you fix that it'll still be slower than it really needs to be because you are reading the entire list each time through to get the count, and as it gets longer, that takes more and more time.
You generally should use in rather than count if you just need to know if an item is in a list, since it will stop as soon as it finds the the item, but in this specific instance you should be using a set anyway, since you are looking for unique values. You can just add to the set without checking to see whether the item is already in it.
Finally, using for loops is more readable than using while loops.
result = set()
for a in xrange(2, 101):
for b in xrange(2, 101):
result.add(a ** b)
print len(result)
This takes less than a second on my machine.

The reason your script is slow and doesn't return a value is that you have created an infinite loop. You need to dedent the a += 1 line by one level, otherwise, after the first time through the inner while loop a will not get incremented again.
There are some additional issues with the script that have been pointed out in the comments, but this is what is responsible for the issues your are experiencing.

Your code is not good, since it does not produce correct results. As the comment by #grael pointed out, you do not recalculate the value of c inside the loop, so you are counting only one value over and over again. There are also other problems, as other people have noted.
Your code is not fast for several reasons.
You are using a brute-force method. The answer can be found more simply by using number theory and combinatorics. Look at the prime factorization of each number between 2 and 100 and consider the prime factorization of each power of that number. You never need to calculate the complete number--the prime factorization is enough. I'll leave the details to you but this would be much faster.
You are rolling your own loop, but it is faster to use python's. Loop a and b with:
for a in range(2,101):
for b in range(2,101):
c = pow(a, b)
# other code here
This code uses the built-in capabilities of the language and should be faster. This also avoids your errors since it is simpler.
You use a very slow method to see if a number has already been calculated. Your if List.count(c) == 0 must check every previous number to see if the current number has been seen. This will become very slow when you have already seen thousands of numbers. It is much faster to keep the already-seen numbers in a set rather than a list. Checking if a number is in a set is much faster than using count() on a list.
Try combining all these suggestions. As another answer shows, just using the last two probably suffice.

Related

How to write the maxint without using the library?

I am not sure what maxint is, and what it does. But I am trying to solve this problem without using maxint, possibly write another function to do this.
I would like to write this function without using:
from sys import maxint
Here is what I am trying to implement without using maxint.
def maxSubArraySum(a,size):
maxi = -maxint - 1
maxi_ends = 0
for i in range(0, size):
maxi_ends = maxi_ends + a[i]
if (maxi < maxi_ends):
maxi = maxi_ends
if maxi_ends < 0:
maxi_ends = 0
return maxi

this is a common pattern in programming (see this answer): to find the maximum number in a list, initialize ans to a value that's guaranteed to be lower than anything else in the list, then go through the list and simply perform the operation
if element > ans:
ans = element
alternatively, you could set ans to be the value of the first value, but this is less elegant, since you need to check for the existence of the first element (otherwise, if the list is empty, you will get an error). Or, the "list" might not be an actual list, but rather an iterator, like in your example, so getting the first value would be messy.
Anyways, that's the point of initializing your variable to be an extreme value. MaxInt, in other languages like java and C, are very useful and nice to use as this extreme value, because the values you encounter along the way are literally not possibly bigger than MaxInt.
But note that, as I said in the very beginning, the point of infinity/maxint/minint is only to be larger/smaller than anything you could possibly encounter. So, for problems like the one you posted, you can usually easily make yourself a lower-bound for the smallest possible value of any maxi_ends. And you can make it a really loose lower-bound too (it doesn't have to be realistically attainable). So here, one way to find the lowest possible value of maxi_ends would be the sum of all the negative values in a. maxi_ends can't possibly go any lower than that.
Even easier, if you know what the smallest possible value of a[i] is, and you know the maximum possible length of a, you can calculate the smallest possible value that maxi_end could take for any input, right off the bat. This is useful in these kinds of abstract programming problems that you see on coding competitions, coding interview prep, and homework, especially since you just need a quick and dirty solution. For example, if a[i] can't be any less than -100, and len(a) is at most 1000, then maxi_end will never exceed -100 * 1000 = -100000, so you can just write maxi = -100001.
It's also a useful python trick in some real life situations (again, where you only need a quick hack) to pick a preposterously large number if you are lazy -- for example, if you're estimating the shortest path, in miles, between two buildings in a road network -- you could just pick 1000000000000 as an upper-bound, since you know no path will ever be that high.
I don't know why you don't want to use sys.maxint (although as the comments say, it's a good idea not to use it), but there are actual ways to represent infinity in python: in python 3.5 or above, you can do
import math
test = math.inf
and otherwise, you can use float('inf').

python: multiplication in for loop skipped on second iteration

I am trying to implement the Sieve of Euler (as described on programmingpraxis.com).I have code that runs fine on the first iteration, however on the next iteration my multiplication is skipped for some reason that escapes me (probably just missing some python-behavior there that is common sense to a more experienced programmer)
I am running this:
import numpy as np
#set up parameters
primes = [2]
startval = 2
stopval = 10000
firstrun = True
candidates = np.arange(start=startval,stop=stopval+1,step=1)
#actual program
for idx in range(int(np.ceil(np.sqrt(stopval)))): #only up until sqrt(n) needs checking
print('pos1')
print(candidates)
print(candidates[0])
times = candidates[0]*candidates
print(times)
diffset = list(set(candidates)^set(times))
if len(diffset) is not 0: #to make sure the program quits properly if diffset=[]
primes.append(diffset.pop(0))
print('pos2')
print(diffset)
candidates = diffset
print('pos3')
print(candidates)
else:
break
print(primes)
The various print statements are just so i can get a grasp on whats going on. Note the first outputs are fine, the interesting part starts the second time pos1 is printed. my candidates are updated just as I want them to, the new first element is also correct. So my question is:
Why is times = candidates[0]*candidatesapparently skipped on the second iteration?
Please note: I am not asking for a "scratch your code, copy this working and faster, better, nicer code" answer. There are plenty of python implementations out there, I want to do this myself. I think I am missing a fairly important concept of python here, and thats why my code doesn't behave.
(Should anyone ask: No, this is not a homework assignment. I am using a bit of python at my workplace and like to do these sorts of things at home to get better at coding)

I just ran your code. Looking at the output of times in line 14 you can see that after the first iteration the operation is performed, but not in the way you intended to. The list times is just three times the list candidates put after one another. To elaborate:
1st iteration
candidates = np.arange(start=startval,stop=stopval+1,step=1)
so candidates is a numpy array. Doing
candidates*candidates[0]
is the same as candidates*2, which is "numpy_array*number", which is just element-wise multiplication.
Now further down you do
diffset = list(set(candidates) ^ set(times))
....
candidates = diffset
which sets up:
2nd iteration
candidates is now a list (see above). Doing
candidates*candidates[0]
is just candidates*3 which is now "list*number" which in python is not "multiply each list element by number", but instead: "create new list as being original list concatenated number times with itself". This is why you don't see the output you expected.
To fix this, simply do:
candidates = np.array(diffset)

Prime numbers with python [duplicate]

This question already has answers here:
Sieve of Eratosthenes - Finding Primes Python
(22 answers)
Closed 7 years ago.
I'm fairly new to programming and I decided to do some exercises to improve my abilities. I'm stuck with an exercise: "Find the sum of all the primes below two million." My code is too slow.
Initially, I tried to solve as a normal prime problem and ended up with this:
sum = 2 + 3
for i in range (5, 2000000, 2):
for j in range (3, i, 2):
if i%j == 0:
break
else:
sum += i
print(sum)
In this way, all even numbers would be excluded from the loop. But it did not solve my problem. The magnitude here is really big.
So I tried to understand what was happening with this code. I have a loop inside a loop, and the loop inside the loop runs the index of the outside loop times (not quite that because the list doesn't start from 0), right? So, when I try to find prime numbers under 20, it runs the outside loop 8 times, but the inside loop 60 (I don't know if this math is correct, as I said, I'm quite knew to programming). But when I use it with 2,000,000, I'm running the inside loop something like 999,993,000,012 times in total, and that is madness.
My friend told me about the Sieve of Eratosthenes, and I tried to create a new code:
list = [2]
list.extend(range(3, 2000000, 2))
for i in list:
for j in list:
if j%i == 0 and j > i:
list.remove(j)
print(sum(list))
And that's what I achieved trying to simulate the sieve (Ignoring even numbers helped). it's a lot faster (with the other code, it would take a long time to find primes under 200,000, and with this new one I can do it) but it is not enough to compute 2,000,000,000 in a reasonable time. The code is running in the background since I started to write, and still nothing. I don't know how many times this thing is looping and I am too tired to think about it now.
I came here to ask for help. Why is it so slow? What should I learn/read/do to improve my code? Is there any other method more efficient than this sieve? Thank you for your time.

Because list.remove is a O(n) operation, and you're doing it a lot. And you're not performing a true sieve, just trial division in disguise; you're still doing all the remainder testing you did in the original code.
A Sieve of Eratosthenes typically is implemented with an array of flags; in the simplest form, each index corresponds to the same number, and the value is initially True for all indices but 0 and 1. You iterate along, and when you find a True value, you set all indices that are multiples of it to False. This means the work is sequential addition, not multiplication, not division (which are much more expensive.

String manipulation appears to be inefficient

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:
genome = FASTAdata[1]
genomeLength = len(genome);
# Hash table holding all the k-mers we will come across
kmers = dict()
# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
for inner in range (outer+2, outer+22):
substring = genome[outer:inner]
if substring in kmers: # if we already have this substring on record, increase its value (count of num of appearances) by 1
kmers[substring] += 1
else:
kmers[substring] = 1 # otherwise record that it's here once
This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?
And for clarity the file in question is nearly 200mb, so pretty big.

I would recommend using a dynamic programming algorithm. The problem is that for all inner strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.

You should use memoryview to avoid creating sub-strings as [:] will then return a "view" instead of a copy, BUT you must use Python 3.3 or higher (before that, they are not hashable).
Also, a Counter will simplify your code.
from collections import Counter
genome = memoryview("abcdefghijkrhirejtvejtijvioecjtiovjitrejabababcd".encode('ascii'))
genomeLength = len(genome)
minlen, maxlen = 2, 22
def fragments():
for start in range (0, genomeLength-minlen):
for finish in range (start+minlen, start+maxlen):
if finish <= genomeLength:
yield genome[start:finish]
count = Counter(fragments())
for (mv, n) in count.most_common(3):
print(n, mv.tobytes())
produces:
4 b'ab'
3 b'jt'
3 b'ej'
A 1,000,000 byte random array takes 45s on my laptop, but 2,000,000 causes swapping (over 8GB memory use). However, since your fragment size is small, you can easily break the problem up into million-long sub-sequences and then combine results at the end (just be careful about overlaps). That would give a total running time for a 200MB array of ~3 hours, with luck.
PS To be clear, by "combine results at the end" I assume that you only need to save the most popular for each 1M sub-sequence, by, for example, writing them to a file. You cannot keep the counter in memory - that is what is using the 8GB. That's fine if you have fragments that occur many thousands of times, but obviously won't work for smaller numbers (you might see a fragment just once in each of the 200 1M sub-sequences, and so never save it, for example). In other words, results will be lower bounds that are incomplete, particularly at lower frequencies (values are complete only if a fragment is found and recorded in every sub-sequence). If you need an exact result, this is not suitable.

Python loop transfer in other programming language

Again I have a question concerning large loops.
Suppose I have a function
limits
def limits(a,b):
*evaluate integral with upper and lower limits a and b*
return float result
A and B are simple np.arrays that store my values a and b. Now I want to calculate the integral 300'000^2/2 times because A and B are of the length of 300'000 each and the integral is symmetrical.
In Python I tried several ways like itertools.combinations_with_replacement to create the combinations of A and B and then put them into the integral but that takes huge amount of time and the memory is totally overloaded.
Is there any way, for example transferring the loop in another language, to speed this up?
I would like to run the loop
for i in range(len(A)):
for j in range(len(B)):
np.histogram(limits(A[i],B[j]))
I think histrogramming the return of limits is desirable in order not to store additional arrays that grow squarely.
From what I read python is not really the best choice for this iterative ansatzes.
So would it be reasonable to evaluate this loop in another language within Python, if yes, How to do it. I know there are ways to transfer code, but I have never done it so far.
Thanks for your help.

If you're worried about memory footprint, all you need to do is bin the results as you go in the for loop.
num_bins = 100
bin_upper_limits = np.linspace(-456, 456, num=num_bins-1)
# (last bin has no upper limit, it goes from 456 to infinity)
bin_count = np.zeros(num_bins)
for a in A:
for b in B:
if b<a:
# you said the integral is symmetric, so we can skip these, right?
continue
new_result = limits(a,b)
which_bin = np.digitize([new_result], bin_upper_limits)
bin_count[which_bin] += 1
So nothing large is saved in memory.
As for speed, I imagine that the overwhelming majority of time is spent evaluating limits(a,b). The looping and binning is plenty fast in this case, even in python. To convince yourself of this, try replacing the line new_result = limits(a,b) with new_result = 234. You'll find that the loop runs very fast. (A few minutes on my computer, much much less than the 4 hour figure you quote.) Python does not loop very fast compared to C, but it doesn't matter in this case.
Whatever you do to speed up the limits() call (including implementing it in another language) will speed up the program.
If you change the algorithm, there is vast room for improvement. Let's take an example of what it seems you're doing. Let's say A and B are 0,1,2,3. You're integrating a function over the ranges 0-->0, 0-->1, 1-->1, 1-->2, 0-->2, etc. etc. You're re-doing the same work over and over. If you have integrated 0-->1 and 1-->2, then you can add up those two results to get the integral 0-->2. You don't have to use a fancy integration algorithm, you just have to add two numbers you already know.
Therefore it seems to me that you can compute integrals in all the smallest ranges (0-->1, 1-->2, 2-->3), store the results in an array, and add subsets of the results to get the integral over whatever range you want. If you want this program to run in a few minutes instead of 4 hours, I suggest thinking through an alternative algorithm along those lines.
(Sorry if I'm misunderstanding the problem you're trying to solve.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.