I'm curious why pandas.Series.div() is slower than /= when applying to a pandas Series of numbers. For example:
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser /= 7'
1000 loops, best of 3: 584 usec per loop
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser.div(7)'
1000 loops, best of 3: 746 usec per loop
I assume that it's because the former changes the series in place whereas the latter returns a new Series. But if that's the case, then why bother implementing div() and mul() at all if they're not as fast as /= and */?
Even if you don't want to change the series in place, ser / 7 is still faster than .div():
python3 -m timeit -s 'import pandas as pd; ser = pd.Series(list(range(99999)))' 'ser / 7'
1000 loops, best of 3: 656 usec per loop
So what is the use of pd.Series.div() and what about it makes it slower?
Pandas .div obviously implement division similarly to / and /=.
The main reason to have a separate .div is that Pandas embraces a syntax model where operations on dataframes are described by the applications of consecutive filters, e.g. .div, .str, etc. which allows for simple concatenations:
ser.div(7).apply(lambda x: 'text: ' + str(x)).str.upper()
as well as simpler support for multiple arguments (cfr. .func(a, b, c) would not be possible to write with a binary operator).
By contrast, the same would have been written without div as:
(ser / 7).apply(lambda x: 'text: ' + str(x)).str.upper()
The / operation may be faster because there is less Python overhead associated with / operator compared to .div().
By contrast, the x /= y operator replaces the construct x = x / y.
For vectorized containers based on NumPy (like Pandas), it goes a little beyond that: it uses an in-place operation instead of creating a (potentially time- and memory-consuming) copy of x. This is the reason why /= is faster than both / and .div().
Note that, while in most cases this is equivalent, sometimes (like in this case) it may still require conversion to a different data type, which is done automatically in Pandas (but not in NumPy).
Related
I have a list, but the numbers in it are strings so I can't find the sum of the list, so I need help in converting the numbers in the list to int.
This is my code
def convertStr(cals):
ret = float(cals)
return ret
TotalCal = sum(cals)
So basically there is list called cals
and it looks like this
(20,45,...etc)
But the numbers in it are strings so when I try finding the sum like this
TotalCal = sum(cals)
And then run it shows an error saying that the list needs to be an int format
so the question is how do I convert all numbers in the list to int format?
If you have a different way of finding the sum of lists it will be good too.
You can use either the python builtin map or a list comprehension for this
def convertStr(cals):
ret = [float(i) for i in (cals)]
return ret
or
def convertStr(cals):
return map(float,cals)
Here are the timeit results for both the approaches
$ python -m timeit "cals = ['1','2','3','4'];[float(i) for i in (cals)]"
1000000 loops, best of 3: 0.804 usec per loop
$ python -m timeit "cals = ['1','2','3','4'];map(float,cals)"
1000000 loops, best of 3: 0.787 usec per loop
As you can see map is faster and more pythonic as compared to the list comprehension. This is discussed in full length here
map may be microscopically faster in some cases (when you're NOT making a lambda for the purpose, but using the same function in map and a listcomp). List comprehensions may be faster in other cases
Another way using itertools.imap. This is the fastest for long lists
from itertools import imap
TotalCal = sum(imap(float,cals)
And using timeit for a list with 1000 entries.
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];sum(map(float,cals))"
1000 loops, best of 3: 1.38 msec per loop
$ python -m timeit "import random;cals = [str(random.randint(0,100)) for r in range(1000)];[float(i) for i in (cals)]"
1000 loops, best of 3: 1.39 msec per loop
$ python -m timeit "from itertools import imap;import random;cals = [str(random.randint(0,100)) for r in range(1000)];imap(float,cals)"
1000 loops, best of 3: 1.24 msec per loop
As Padraic mentions below, The imap way is the best way to go! It is fast1 and looks great! Inclusion of a library function has it's bearing on small lists only and not on large lists. Thus for large lists, imap is better suited.
1 List comprehension is still slower than map by 1 micro second!!! Thank god
sum(map(float,cals))
or
sum(float(i) for i in cals)
Is there a big difference in speed in these two code fragments?
1.
x = set( i for i in data )
versus:
2.
x = set( [ i for i in data ] )
I've seen people recommending set() instead of set([]); is this just a matter of style?
The form
x = set(i for i in data)
is shorthand for:
x = set((i for i in data))
This creates a generator expression which evaluates lazily. Compared to:
x = set([i for i in data])
which creates an entire list before passing it to set
From a performance standpoint, generator expressions allow for short-circuiting in certain functions (all and any come to mind) and takes less memory as you don't need to store the extra list -- In some cases this can be very significant.
If you actually are going to iterate over the entire iterable data, and memory isn't a problem for you, I've found that typically the list-comprehension is slightly faster then the equivalent generator expression*.
temp $ python -m timeit 'set(i for i in "xyzzfoobarbaz")'
100000 loops, best of 3: 3.55 usec per loop
temp $ python -m timeit 'set([i for i in "xyzzfoobarbaz"])'
100000 loops, best of 3: 3.42 usec per loop
Note that if you're curious about speed -- Your fastest bet will probably be just:
x = set(data)
proof:
temp $ python -m timeit 'set("xyzzfoobarbaz")'
1000000 loops, best of 3: 1.83 usec per loop
*Cpython only -- I don't know how Jython or pypy optimize this stuff.
The [] syntax creates a list, which is discarded immediatley after the set is created. So you are increasing the memory footprint of the program.
The generator syntax avoids that.
I have a generator that generates a finite sequence. To determine
the length of this sequence I tried these two approaches:
seq_len = sum([1 for _ in euler14_seq(sv)]) # list comp
and
seq_len = sum(1 for _ in euler14_seq(sv)) # generator expression
sv is a constant starting value for the sequence.
I had expected that list comprehension would be slower and the
generator expression faster, but it turns out the other way around.
I assume the first one will be much more memory intensive since it
creates a complete list in memory first - part of the reason I also thought it would be slower.
My question: Is this observation generalizable? And is this due to
having two generators involved in the second statement vs the first?
I've looked at these What's the shortest way to count the number of items in a generator/iterator?, Length of generator output, and
Is there any built-in way to get the length of an iterable in python? and saw some other approaches to measuring the length of a sequence, but I'm specifically curious about the comparison of list comp vs generator expression.
PS: This came up when I decided to solve Euler Project #14 based on a
question asked on SO yesterday.
(By the way, what's the general feeling regarding use of the '_' in
places where variable values are not needed).
This was done with Python 2.7.2 (32-bit) under Windows 7 64-bit
On this computer, the generator expression becomes faster somewhere between 100,000 and 1,000,000
$ python -m timeit "sum(1 for x in xrange(100000))"
10 loops, best of 3: 34.8 msec per loop
$ python -m timeit "sum([1 for x in xrange(100000)])"
10 loops, best of 3: 20.8 msec per loop
$ python -m timeit "sum(1 for x in xrange(1000000))"
10 loops, best of 3: 315 msec per loop
$ python -m timeit "sum([1 for x in xrange(1000000)])"
10 loops, best of 3: 469 msec per loop
The following code block should generate the length:
>>> gen1 = (x for x in range(10))
>>> len(list(gen1))
10
I'm confused about when numpy's numpy.apply_along_axis() function will outperform a simple Python loop. For example, consider the case of a matrix with many rows, and you wish to compute the sum of each row:
x = np.ones([100000, 3])
sums1 = np.array([np.sum(x[i,:]) for i in range(x.shape[0])])
sums2 = np.apply_along_axis(np.sum, 1, x)
Here I am even using a built-in numpy function, np.sum, and yet calculating sums1 (Python loop) takes less than 400ms while calculating sums2 (apply_along_axis) takes over 2000ms (NumPy 1.6.1 on Windows). By further way of comparison, R's rowMeans function can often do this in less than 20ms (I'm pretty sure it's calling C code) while the similar R function apply() can do it in about 600ms.
np.sum take an axis parameter, so you could compute the sum simply using
sums3 = np.sum(x, axis=1)
This is much faster than the 2 methods you posed.
$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.apply_along_axis(np.sum, 1, x)"
1 loops, best of 1: 3.21 sec per loop
$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.array([np.sum(x[i,:]) for i in range(x.shape[0])])"
1 loops, best of 1: 712 msec per loop
$ python -m timeit -n 1 -r 1 -s "import numpy as np;x=np.ones([100000,3])" "np.sum(x, axis=1)"
1 loops, best of 1: 1.81 msec per loop
(As for why apply_along_axis is slower — I don't know, probably because the function is written in pure Python and is much more generic and thus less optimization opportunity than the array version.)
I am just getting to know numpy, and I am impressed by its claims of C-like efficiency with memory access in its ndarrays. I wanted to see the differences between these and pythonic lists for myself, so I ran a quick timing test, performing a few of the same simple tasks with numpy without it. Numpy outclassed regular lists by an order of magnitude in the allocation of and arithmetic operations on arrays, as expected. But this segment of code, identical in both tests, took about 1/8 of a second with a regular list, and slightly over 2.5 seconds with numpy:
file = open('timing.log','w')
for num in a2:
if num % 1000 == 0:
file.write("Multiple of 1000!\r\n")
file.close()
Does anyone know why this might be, and if there is some other syntax i should be using for operations like this to take better advantage of what the ndarray can do?
Thanks...
EDIT: To answer Wayne's comment... I timed them both repeatedly and in different orders and got pretty much identical results each time, so I doubt it's another process. I put start = time() at the top of the file after the numpy import and then I have statements like print 'Time after traversal:\t',(time() - start) throughout.
a2 is a NumPy array, right? One possible reason it might be taking so long in NumPy (if other processes' activity don't account for it as Wayne Werner suggested) is that you're iterating over the array using a Python loop. At every step of the iteration, Python has to fetch a single value out of the NumPy array and convert it to a Python integer, which is not a particularly fast operation.
NumPy works much better when you are able to perform operations on the whole array as a unit. In your case, one option (maybe not even the fastest) would be
file.write("Multiple of 1000!\r\n" * (a2 % 1000 == 0).sum())
Try comparing that to the pure-Python equivalent,
file.write("Multiple of 1000!\r\n" * sum(filter(lambda i: i % 1000 == 0, a2)))
or
file.write("Multiple of 1000!\r\n" * sum(1 for i in a2 if i % 1000 == 0))
I'm not surprised that NumPy does poorly w/r/t Python built-ins when using your snippet. A large fraction of the performance benefit in NumPy arises from avoiding the loops and instead access the array by indexing:
In NumPy, it's more common to do something like this:
A = NP.random.randint(10, 100, 100).reshape(10, 10)
w = A[A % 2 == 0]
NP.save("test_file.npy", w)
Per-element access is very slow for numpy arrays. Use vector operations:
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 1.53 sec per loop
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> (a2 % 1000 == 0).sum()'
10 loops, best of 3: 22.6 msec per loop
$ python -mtimeit -s 'import numpy as np; a2= range(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 90.9 msec per loop