This is mostly an exercise in learning Python. I wrote this function to test if a number is prime:
def p1(n):
for d in xrange(2, int(math.sqrt(n)) + 1):
if n % d == 0:
return False
return True
Then I realized I can make easily rewrite it using any():
def p2(n):
return not any((n % d == 0) for d in xrange(2, int(math.sqrt(n)) + 1))
Performance-wise, I was expecting p2 to be faster than, or at the very least as fast as, p1 because any() is builtin, but for a large-ish prime, p2 is quite a bit slower:
$ python -m timeit -n 100000 -s "import test" "test.p1(999983)"
100000 loops, best of 3: 60.2 usec per loop
$ python -m timeit -n 100000 -s "import test" "test.p2(999983)"
100000 loops, best of 3: 88.1 usec per loop
Am I using any() incorrectly here? Is there a way to write this function using any() so that it's as far as iterating myself?
Update: Numbers for an even larger prime
$ python -m timeit -n 1000 -s "import test" "test.p1(9999999999971)"
1000 loops, best of 3: 181 msec per loop
$ python -m timeit -n 1000 -s "import test" "test.p2(9999999999971)"
1000 loops, best of 3: 261 msec per loop
The performance difference is minimal, but the reason it exists is that any incurs building a generator expression, and an extra function call, compared to the for loop. Both have identical behaviors, though (shortcut evaluation).
As the size of your input grows, the difference won't diminish (I was wrong) because you're using a generator expression, and iterating over it requires calling a method (.next()) on it and an extra stack frame. any does that under the hood, of course.
The for loop is iterating over an xrange object. any is iterating over a generator expression, which itself is iterating over an xrange object.
Either way, use whichever produces the most readable/maintainable code. Choosing one over the other will have little, if any, performance impact on whatever program you're writing.
Related
Imagine list of strings like this one: ('{hello world} is awesome', 'Hello world is less awesome', '{hello world} is {awesome} too'). I want to check each string in for cycle for starting character, I think I have got 4 options:
if re.search(r'^\{', i):
if re.match(r'\{', i):
if i.startswith('{'):
if i[:1] == '{':
Which is the fastest one? Is there some even more faster than these 4 options?
Note: The starting string to compare could be longer, not only one letter, e.g. {hello
The fastest is i[0] == value, since it directly uses a pointer to the underlying array. Regex needs to (at least) parse the pattern, while startsWith has the overhead of a method call and creating a slice of that size before the actual comparison.
As #dsqdfg said in the comments, there is a timing function in python I've never known until now. I tried to measure them and there are some results:
python -m timeit -s 'text="{hello world}"' 'text[:6] == "{hello"'
1000000 loops, best of 3: 0.224 usec per loop
python -m timeit -s 'text="{hello world}"' 'text.startswith("{hello")'
1000000 loops, best of 3: 0.291 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.match(r"\{hello", text)'
100000 loops, best of 3: 2.53 usec per loop
python -m timeit -s 'text="{hello world}"' 'import re' 're.search(r"^\{hello", text)'
100000 loops, best of 3: 2.86 usec per loop
I have a list, but the numbers in it are strings so I can't find the sum of the list, so I need help in converting the numbers in the list to int.
This is my code
def convertStr(cals):
ret = float(cals)
return ret
TotalCal = sum(cals)
So basically there is list called cals
and it looks like this
(20,45,...etc)
But the numbers in it are strings so when I try finding the sum like this
TotalCal = sum(cals)
And then run it shows an error saying that the list needs to be an int format
so the question is how do I convert all numbers in the list to int format?
If you have a different way of finding the sum of lists it will be good too.
You can use either the python builtin map or a list comprehension for this
def convertStr(cals):
ret = [float(i) for i in (cals)]
return ret
or
def convertStr(cals):
return map(float,cals)
Here are the timeit results for both the approaches
$ python -m timeit "cals = ['1','2','3','4'];[float(i) for i in (cals)]"
1000000 loops, best of 3: 0.804 usec per loop
$ python -m timeit "cals = ['1','2','3','4'];map(float,cals)"
1000000 loops, best of 3: 0.787 usec per loop
As you can see map is faster and more pythonic as compared to the list comprehension. This is discussed in full length here
map may be microscopically faster in some cases (when you're NOT making a lambda for the purpose, but using the same function in map and a listcomp). List comprehensions may be faster in other cases
Another way using itertools.imap. This is the fastest for long lists
from itertools import imap
TotalCal = sum(imap(float,cals)
And using timeit for a list with 1000 entries.
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];sum(map(float,cals))"
1000 loops, best of 3: 1.38 msec per loop
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];[float(i) for i in (cals)]"
1000 loops, best of 3: 1.39 msec per loop
$ python -m timeit "from itertools import imap;import random;cals = [str(random.randint(0,100)) for r in range(1000)];imap(float,cals)"
1000 loops, best of 3: 1.24 msec per loop
As Padraic mentions below, The imap way is the best way to go! It is fast1 and looks great! Inclusion of a library function has it's bearing on small lists only and not on large lists. Thus for large lists, imap is better suited.
1 List comprehension is still slower than map by 1 micro second!!! Thank god
sum(map(float,cals))
or
sum(float(i) for i in cals)
Is there a big difference in speed in these two code fragments?
1.
x = set( i for i in data )
versus:
2.
x = set( [ i for i in data ] )
I've seen people recommending set() instead of set([]); is this just a matter of style?
The form
x = set(i for i in data)
is shorthand for:
x = set((i for i in data))
This creates a generator expression which evaluates lazily. Compared to:
x = set([i for i in data])
which creates an entire list before passing it to set
From a performance standpoint, generator expressions allow for short-circuiting in certain functions (all and any come to mind) and takes less memory as you don't need to store the extra list -- In some cases this can be very significant.
If you actually are going to iterate over the entire iterable data, and memory isn't a problem for you, I've found that typically the list-comprehension is slightly faster then the equivalent generator expression*.
temp $ python -m timeit 'set(i for i in "xyzzfoobarbaz")'
100000 loops, best of 3: 3.55 usec per loop
temp $ python -m timeit 'set([i for i in "xyzzfoobarbaz"])'
100000 loops, best of 3: 3.42 usec per loop
Note that if you're curious about speed -- Your fastest bet will probably be just:
x = set(data)
proof:
temp $ python -m timeit 'set("xyzzfoobarbaz")'
1000000 loops, best of 3: 1.83 usec per loop
*Cpython only -- I don't know how Jython or pypy optimize this stuff.
The [] syntax creates a list, which is discarded immediatley after the set is created. So you are increasing the memory footprint of the program.
The generator syntax avoids that.
From this site, it says that list.index() is a linear search through the list.
And it also seems like in is also linear.
Is there any advantage to using one over the other?
If you want to compare different python approaches, such as the in operator versus .index(), use the timeit module to test the speed differences. Python data type complexities are documented on http://wiki.python.org/moin/TimeComplexity.
Do note that there is a big difference between in and .index(); the first one returns a boolean, the latter the index of the found item (an integer) or it'll raise an exception. It thus is (slightly) slower for the average case:
$ python -mtimeit -s 'a = list(range(10000))' '5000 in a'
10000 loops, best of 3: 107 usec per loop
$ python -mtimeit -s 'a = list(range(10000))' 'a.index(5000)'
10000 loops, best of 3: 111 usec per loop
If you need to optimize for membership testing, use a set() instead:
$ python -mtimeit -s 'a = set(range(10000))' '5000 in a'
10000000 loops, best of 3: 0.108 usec per loop
I have a generator that generates a finite sequence. To determine
the length of this sequence I tried these two approaches:
seq_len = sum([1 for _ in euler14_seq(sv)]) # list comp
and
seq_len = sum(1 for _ in euler14_seq(sv)) # generator expression
sv is a constant starting value for the sequence.
I had expected that list comprehension would be slower and the
generator expression faster, but it turns out the other way around.
I assume the first one will be much more memory intensive since it
creates a complete list in memory first - part of the reason I also thought it would be slower.
My question: Is this observation generalizable? And is this due to
having two generators involved in the second statement vs the first?
I've looked at these What's the shortest way to count the number of items in a generator/iterator?, Length of generator output, and
Is there any built-in way to get the length of an iterable in python? and saw some other approaches to measuring the length of a sequence, but I'm specifically curious about the comparison of list comp vs generator expression.
PS: This came up when I decided to solve Euler Project #14 based on a
question asked on SO yesterday.
(By the way, what's the general feeling regarding use of the '_' in
places where variable values are not needed).
This was done with Python 2.7.2 (32-bit) under Windows 7 64-bit
On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000
$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop
The following code block should generate the length:
>>> gen1 = (x for x in range(10))
>>> len(list(gen1))
10