list comprehension and map without lambda on long string - python

$ python -m timeit -s'tes = "987kkv45kk321"*100' 'a = [list(i) for i in tes.split("kk")]'
10000 loops, best of 3: 79.4 usec per loop
$ python -m timeit -s'tes = "987kkv45kk321"*100' 'b = list(map(list, tes.split("kk")))'
10000 loops, best of 3: 66.9 usec per loop
$ python -m timeit -s'tes = "987kkv45kk321"*10' 'a = [list(i) for i in tes.split("kk")]'
100000 loops, best of 3: 8.34 usec per loop
$ python -m timeit -s'tes = "987kkv45kk321"*10' 'b = list(map(list, tes.split("kk")))'
100000 loops, best of 3: 7.38 usec per loop
$ python -m timeit -s'tes = "987kkv45kk321"' 'a = [list(i) for i in tes.split("kk")]'
1000000 loops, best of 3: 1.51 usec per loop
$ python -m timeit -s'tes = "987kkv45kk321"' 'b = list(map(list, tes.split("kk")))'
1000000 loops, best of 3: 1.63 usec per loop
I tried using timeit and wonder why creating list of lists from string.split() with list comprehension is faster for a shorter string but slower for longer string.

The fixed setup costs for map are higher than the setup costs for the listcomp solution. But the per-item costs for map are lower. So for short inputs, map is paying more in fixed setup costs than it saves on the per item costs (because there are so few items). When the number of items increases, the fixed setup costs for map don't change, but the savings per item is being reaped for more items, so map slowly pulls ahead.
Things that map saves on:
Only looks up list once (the listcomp has to look it up in the builtin namespace every single loop, after checking the nested and global scopes first, because it can't guarantee list isn't overridden from loop to loop)
Executes no Python bytecode per item (because the mapping function is also C level), so the interpreter doesn't get involved at all, reducing the amount of hot C level code
map loses on the actual call to map (C built-in functions are fast to run, but comparatively slow to call, especially if they take variable length arguments), and the creation and cleanup of the map object (the listcomp closure is compiled up front). But as I noted above, neither of these is tied to the size of the inputs, so you make up for it rapidly if the mapping function is a C builtin.

This kind of timing is basically useless.
The time frames you are getting are in microseconds - and you are just creating tens of different one-character-long-elements list in each interaction. You get basically linear type, because the number of objects you create is proportional to your string lengths. There is hardly any surprise in this.

Related

Converting str numbers in list to int and find out the sum of the list

I have a list, but the numbers in it are strings so I can't find the sum of the list, so I need help in converting the numbers in the list to int.
This is my code
def convertStr(cals):
ret = float(cals)
return ret
TotalCal = sum(cals)
So basically there is list called cals
and it looks like this
(20,45,...etc)
But the numbers in it are strings so when I try finding the sum like this
TotalCal = sum(cals)
And then run it shows an error saying that the list needs to be an int format
so the question is how do I convert all numbers in the list to int format?
If you have a different way of finding the sum of lists it will be good too.
You can use either the python builtin map or a list comprehension for this
def convertStr(cals):
ret = [float(i) for i in (cals)]
return ret
or
def convertStr(cals):
return map(float,cals)
Here are the timeit results for both the approaches
$ python -m timeit "cals = ['1','2','3','4'];[float(i) for i in (cals)]"
1000000 loops, best of 3: 0.804 usec per loop
$ python -m timeit "cals = ['1','2','3','4'];map(float,cals)"
1000000 loops, best of 3: 0.787 usec per loop
As you can see map is faster and more pythonic as compared to the list comprehension. This is discussed in full length here
map may be microscopically faster in some cases (when you're NOT making a lambda for the purpose, but using the same function in map and a listcomp). List comprehensions may be faster in other cases
Another way using itertools.imap. This is the fastest for long lists
from itertools import imap
TotalCal = sum(imap(float,cals)
And using timeit for a list with 1000 entries.
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];sum(map(float,cals))"
1000 loops, best of 3: 1.38 msec per loop
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];[float(i) for i in (cals)]"
1000 loops, best of 3: 1.39 msec per loop
$ python -m timeit "from itertools import imap;import random;cals = [str(random.randint(0,100)) for r in range(1000)];imap(float,cals)"
1000 loops, best of 3: 1.24 msec per loop
As Padraic mentions below, The imap way is the best way to go! It is fast1 and looks great! Inclusion of a library function has it's bearing on small lists only and not on large lists. Thus for large lists, imap is better suited.
1 List comprehension is still slower than map by 1 micro second!!! Thank god
sum(map(float,cals))
or
sum(float(i) for i in cals)

In python, will converting variables to its own type waste CPU power?

I'm trying to fetch an int id number from database and some ids are mistakenly stored as string, I'm wondering which of the following way is better:
# the first method
new_id = int(old_id) + 1
# second
if isinstance(old_id, str):
new_id = int(old_id) + 1
else:
new_id = old_id +1
So the question is, does it cost to convert a variable to its own type in python?
Let's check!
~/Coding > python -m timeit -s "id1=1;id2='1'" "new_id = int(id1)" "new_id = int(id2)"
1000000 loops, best of 3: 0.755 usec per loop
~/Coding > python -m timeit -s "id1=1;id2='1';f=lambda x: int(x) if isinstance(x, str) else x" "new_id=f(id1)" "new_id=f(id2)"
1000000 loops, best of 3: 1.15 usec per loop
Looks like the most efficient way is simply doing the int conversion without checking.
I'm open to being corrected that the issue here is the lambda or something else I did.
Update:
This may actually not be a fair answer, because the if check itself is much quicker than the type conversion.
~/Coding > python -m timeit "int('3')"
1000000 loops, best of 3: 0.562 usec per loop
~/Coding > python -m timeit "int(3)"
10000000 loops, best of 3: 0.136 usec per loop
~/Coding > python -m timeit "if isinstance('3', str): pass"
10000000 loops, best of 3: 0.0966 usec per loop
This means that it depends on how many of your ids you expect to be strings to see which is worth it.
Update 2:
I've gone a bit overboard here, but we can determine exactly when it's right to switch over using the above timings depending on how many strings you expect to have.
Where z is the total number of ids and s is the percentage of them that are strings, and all values in microseconds,
Always check type: (assuming returning int costs 0 time)
.0966*z + .562*z*s
Always convert without checking:
.136*z*(1-s) + .562*z*s
When we do the math, the z's and string conversions cancel out (since you have to convert the string regardless), and we end up with the following:
s ~= 0.289706
So it looks like 29% strings or so is about the time when you'd cross over from one method to the other.

What is the speed difference between Python's set() and set([])?

Is there a big difference in speed in these two code fragments?
1.
x = set( i for i in data )
versus:
2.
x = set( [ i for i in data ] )
I've seen people recommending set() instead of set([]); is this just a matter of style?
The form
x = set(i for i in data)
is shorthand for:
x = set((i for i in data))
This creates a generator expression which evaluates lazily. Compared to:
x = set([i for i in data])
which creates an entire list before passing it to set
From a performance standpoint, generator expressions allow for short-circuiting in certain functions (all and any come to mind) and takes less memory as you don't need to store the extra list -- In some cases this can be very significant.
If you actually are going to iterate over the entire iterable data, and memory isn't a problem for you, I've found that typically the list-comprehension is slightly faster then the equivalent generator expression*.
temp $ python -m timeit 'set(i for i in "xyzzfoobarbaz")'
100000 loops, best of 3: 3.55 usec per loop
temp $ python -m timeit 'set([i for i in "xyzzfoobarbaz"])'
100000 loops, best of 3: 3.42 usec per loop
Note that if you're curious about speed -- Your fastest bet will probably be just:
x = set(data)
proof:
temp $ python -m timeit 'set("xyzzfoobarbaz")'
1000000 loops, best of 3: 1.83 usec per loop
*Cpython only -- I don't know how Jython or pypy optimize this stuff.
The [] syntax creates a list, which is discarded immediatley after the set is created. So you are increasing the memory footprint of the program.
The generator syntax avoids that.

What is faster for searching items in list, in operator or index()?

From this site, it says that list.index() is a linear search through the list.
And it also seems like in is also linear.
Is there any advantage to using one over the other?
If you want to compare different python approaches, such as the in operator versus .index(), use the timeit module to test the speed differences. Python data type complexities are documented on http://wiki.python.org/moin/TimeComplexity.
Do note that there is a big difference between in and .index(); the first one returns a boolean, the latter the index of the found item (an integer) or it'll raise an exception. It thus is (slightly) slower for the average case:
$ python -mtimeit -s 'a = list(range(10000))' '5000 in a'
10000 loops, best of 3: 107 usec per loop
$ python -mtimeit -s 'a = list(range(10000))' 'a.index(5000)'
10000 loops, best of 3: 111 usec per loop
If you need to optimize for membership testing, use a set() instead:
$ python -mtimeit -s 'a = set(range(10000))' '5000 in a'
10000000 loops, best of 3: 0.108 usec per loop

How to measure length of generator sequence (list comp vs generator expression)

I have a generator that generates a finite sequence. To determine
the length of this sequence I tried these two approaches:
seq_len = sum([1 for _ in euler14_seq(sv)]) # list comp
and
seq_len = sum(1 for _ in euler14_seq(sv)) # generator expression
sv is a constant starting value for the sequence.
I had expected that list comprehension would be slower and the
generator expression faster, but it turns out the other way around.
I assume the first one will be much more memory intensive since it
creates a complete list in memory first - part of the reason I also thought it would be slower.
My question: Is this observation generalizable? And is this due to
having two generators involved in the second statement vs the first?
I've looked at these What's the shortest way to count the number of items in a generator/iterator?, Length of generator output, and
Is there any built-in way to get the length of an iterable in python? and saw some other approaches to measuring the length of a sequence, but I'm specifically curious about the comparison of list comp vs generator expression.
PS: This came up when I decided to solve Euler Project #14 based on a
question asked on SO yesterday.
(By the way, what's the general feeling regarding use of the '_' in
places where variable values are not needed).
This was done with Python 2.7.2 (32-bit) under Windows 7 64-bit
On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000
$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop
The following code block should generate the length:
>>> gen1 = (x for x in range(10))
>>> len(list(gen1))
10

Categories