So I have written down the codes for evaluating polynomial using three different methods. Horner's method should be the fastest, while the naive method should be the slowest, right? But how come the time for computing it is not what I expect? And the time for calculation sometimes turns out to be exactly the same for itera and naive method. What's wrong with it?
import numpy.random as npr
import time
def Horner(c,x):
p=0
for i in c[-1::-1]:
p = p*x+i
return p
def naive(c,x):
n = len(c)
p = 0
for i in range(len(c)):
p += c[i]*x**i
return p
def itera(c,x):
p = 0
xi = 1
for i in range(len(c)):
p += c[i]*xi
xi *= x
return p
c=npr.uniform(size=(500,1))
x=-1.34
start_time=time.time()
print Horner(c,x)
print time.time()-start_time
start_time=time.time()
print itera(c,x)
print time.time()-start_time
start_time=time.time()
print naive(c,x)
print time.time()-start_time
here are some of the results:
[ 2.58646959e+69]
0.00699996948242
[ 2.58646959e+69]
0.00600004196167
[ 2.58646959e+69]
0.00600004196167
[ -3.30717922e+69]
0.00899982452393
[ -3.30717922e+69]
0.00600004196167
[ -3.30717922e+69]
0.00600004196167
[ -2.83469309e+69]
0.00999999046326
[ -2.83469309e+69]
0.00999999046326
[ -2.83469309e+69]
0.0120000839233
Your profiling can be much improved. Plus, we can make your code run 200-500x faster.
(1) Rinse and repeat
You can't run just one iteration of a performance test, for two reasons.
Your time resolution might not be good enough. This is why you sometimes got the same time for two implementations: the time for one run was near the resolution of your timing mechanism, so you recorded only one "tick".
There are all sorts of factors that affect performance. Your best bet for a meaningful comparison will be a lot of iterations.
You don't need gazillions of runs (though, of course, that doesn't hurt), but you estimate and adjust the number of iterations until the variance is within a level acceptable to your purpose.
timeit is a nice little module for profiling Python code.
I added this to bottom of your script.
import timeit
n = 1000
print 'Horner', timeit.timeit(
number = n,
setup='from __main__ import Horner, c, x',
stmt='Horner(c,x)'
)
print 'naive', timeit.timeit(
number = n,
setup='from __main__ import naive, c, x',
stmt='naive(c,x)',
)
print 'itera', timeit.timeit(
number = n,
setup='from __main__ import itera, c, x',
stmt='itera(c,x)',
)
Which produces
Horner 1.8656351566314697
naive 2.2408010959625244
itera 1.9751169681549072
Horner is the fastest, but it's not exactly blowing the doors off the other two.
(2) Look at what is happening...very carefully
Python has operator overloading, so it's easy to miss seeing this.
npr.uniform(size=(500,1)) is giving you a 500 x 1 numpy structure of random numbers.
So what?
Well, c[i] isn't a number. It's a numpy array with one element. Numpy overloads the operators so you can do things like multiply an array by a scalar.
That's fine, but using an array for every element is a lot of overhead, so it's harder to see the difference between the algorithms.
Instead, let's try a simple Python list:
import random
c = [random.random() for _ in range(500)]
And now,
Horner 0.034661054611206055
naive 0.12771987915039062
itera 0.07331395149230957
Whoa! All the time times just got faster (by 10-60x). Proportionally, the Horner implementation got even faster than the other two. We removed the overhead on all three, and can now see the "bare bones" difference.
Horner is 4x faster than naive and 2x faster than itera.
(3) Alternate runtimes
You're using Python 2. I assume 2.7.
Let's see how Python 3.4 fares. (Syntax adjustment: you'll need to put parenthesis around the argument list to print.)
Horner 0.03298933599944576
naive 0.13706714100044337
itera 0.06771054599812487
About the same.
Let's try PyPy, a JIT implementation of Python. (The "normal" Python implementation is called CPython.)
Horner 0.006507158279418945
naive 0.07541298866271973
itera 0.005059003829956055
Nice! Each implementation is now running 2-5x faster. Horner is now 10x the speed of naive, but slightly slower than itera.
JIT runtimes are more difficult to profile than interpreters. Let's increase the number of iterations to 50000, and try it just to make sure.
Horner 0.12749004364013672
naive 3.2823100090026855
itera 0.06546688079833984
(Note that we have 50x the iterations, but only 20x the time...the JIT hadn't taken full effect for many of the first 1000 runs.) Same conclusions, but the differences are even more pronounced.
Granted, the idea of JIT is to profile, analyze, and rewrite the program at runtime, so if your goal is to compare algorithms, this is going to add a lot of non-obvious implementation detail.
Nonetheless, comparing runtimes can be useful in giving a broader perspective.
There are a few more things. For example, your naive implementation computes a variable it never uses. You use range instead of xrange. You could try iterating backwards with an index rather than a reverse slice. Etc.
None of these changed the results much for me, but they were worth considering.
You cannot obtain accurate result by measuring things like that:
start_time=time.time()
print Horner(c,x)
print time.time()-start_time
Presumably most of the time is spend in the IO function involved by the print function. In addition, to have something significant, you should perform the measure on a large number of iteration in order to smooth errors. In the general case, you might want to perform your test on various input data as well -- as depending your algorithm, some case might coincidentally be solved more efficiently than others.
You should definitively take a look at the timeit module. Something like that, maybe:
import timeit
print 'Horner',timeit.timeit(stmt='Horner(c,x)',
setup='from __main__ import Horner, c, x',
number = 10000)
# ^^^^^
# probably not enough. Increase that once you will
# be confident
print 'naive',timeit.timeit(stmt='naive(c,x)',
setup='from __main__ import naive, c, x',
number = 10000)
print 'itera',timeit.timeit(stmt='itera(c,x)',
setup='from __main__ import itera, c, x',
number = 10000)
Producing this on my system:
Horner 23.3317809105
naive 28.305519104
itera 24.385917902
But still with variable results from on run to the other:
Horner 21.1151690483
naive 23.4374330044
itera 21.305426836
As I said before, to obtain more meaningful results, you should definitively increase the number of tests, and run that on several test case in order to smooth results.
If you are doing a lot of benchmarking, scientific computing, numpy related work and many more things using ipython will be an extremely useful tool.
To benchmark you can time the code with timeit using ipython magic where you will get more consistent results each run, it is simply a matter of using timeit then the function or code to time :
In [28]: timeit Horner(c,x)
1000 loops, best of 3: 670 µs per loop
In [29]: timeit naive(c,x)
1000 loops, best of 3: 983 µs per loop
In [30]: timeit itera(c,x)
1000 loops, best of 3: 804 µs per loop
To time code spanning more than one line you simply use %%timeit:
In [35]: %%timeit
....: for i in range(100):
....: i ** i
....:
10000 loops, best of 3: 110 µs per loop
ipython can compile cython code, f2py code and do numerous other very helpful tasks using different plugins and ipython magic commands.
builtin magic commands
Using cython and some very basic improvements we can improve the efficiency of Horner by about 25 percent:
In [166]: %%cython
import numpy as np
cimport numpy as np
cimport cython
ctypedef np.float_t DTYPE_t
def C_Horner(c, DTYPE_t x):
cdef DTYPE_t p
for i in reversed(c):
p = p * x + i
return p
In [28]: c=npr.uniform(size=(2000,1))
In [29]: timeit Horner(c,-1.34)
100 loops, best of 3: 3.93 ms per loop
In [30]: timeit C_Horner(c,-1.34)
100 loops, best of 3: 2.21 ms per loop
In [31]: timeit itera(c,x)
100 loops, best of 3: 4.10 ms per loop
In [32]: timeit naive(c,x)
100 loops, best of 3: 4.95 ms per loop
Using the list in #Paul drapers answer our cythonised version runs twice as fast as the original function and much faster then ietra and naive:
In [214]: import random
In [215]: c = [random.random() for _ in range(500)]
In [44]: timeit C_Horner(c, -1.34)
10000 loops, best of 3: 18.9 µs per loop
In [45]: timeit Horner(c, -1.34)
10000 loops, best of 3: 44.6 µs per loop
In [46]: timeit naive(c, -1.34)
10000 loops, best of 3: 167 µs per loop
In [47]: timeit itera(c,-1.34)
10000 loops, best of 3: 75.8 µs per loop
Related
I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?
You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().
Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?
Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.
While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.
so i need to improve the execution time for a script that i have been working on. I started working with numba jit decorator to try parallel computing however it throws me
KeyError: "Does not support option: 'parallel'"
so i decided to test the nogil if it unlocks the whole capabilities from my cpu but it was slower than pure python i dont understand why this happened, and if someone can help me or guide me i will be very grateful
import numpy as np
from numba import *
#jit(['float64[:,:],float64[:,:]'],'(n,m),(n,m)->(n,m)',nogil=True)
def asd(x,y):
return x+y
u=np.random.random(100)
w=np.random.random(100)
%timeit asd(u,w)
%timeit u+w
10000 loops, best of 3: 137 µs per loop
The slowest run took 7.13 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.75 µs per loop
You cannot expect numba to outperform numpy on such a simple vectorized operation. Also your comparison isn't exactly fair since the numba function includes the cost of the outside function call. If you sum a larger array, you'll see that the performance of the two converge and what you are seeing is just overhead on a very fast operation:
import numpy as np
import numba as nb
#nb.njit
def asd(x,y):
return x+y
def asd2(x, y):
return x + y
u=np.random.random(10000)
w=np.random.random(10000)
%timeit asd(u,w)
%timeit asd2(u,w)
The slowest run took 17796.43 times longer than the fastest. This could mean
that an intermediate result is being cached.
100000 loops, best of 3: 6.06 µs per loop
The slowest run took 29.94 times longer than the fastest. This could mean that
an intermediate result is being cached.
100000 loops, best of 3: 5.11 µs per loop
As far as parallel functionality, for this simple operation, you can use nb.vectorize:
#nb.vectorize([nb.float64(nb.float64, nb.float64)], target='parallel')
def asd3(x, y):
return x + y
u=np.random.random((100000, 10))
w=np.random.random((100000, 10))
%timeit asd(u,w)
%timeit asd2(u,w)
%timeit asd3(u,w)
But again, if you operate on small arrays, you are going to be seeing the overhead of thread dispatch. For the array sizes above, I see the parallel giving me a 2x speedup.
Where numba really shines is doing operations that are difficult to do in numpy using broadcasting, or when operations would result in a lot of temporary intermediate array allocations.
I expected array.array to be faster than lists, as arrays seem to be unboxed.
However, I get the following result:
In [1]: import array
In [2]: L = list(range(100000000))
In [3]: A = array.array('l', range(100000000))
In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop
In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop
In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop
In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop
What could be the cause of such a difference?
The storage is "unboxed", but every time you access an element Python has to "box" it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.
So, in the end, an array is generally slower, but requires substantially less memory.
Here's the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.
Here's the code to access a list item:
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
/* error checking omitted */
return ((PyListObject *)op) -> ob_item[i];
}
There's very little to it: somelist[i] just returns the i'th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).
And here's the __getitem__ implementation for an array with type code l:
static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
return PyLong_FromLong(((long *)ap->ob_item)[i]);
}
The raw memory is treated as a vector of platform-native C long integers; the i'th C long is read up; and then PyLong_FromLong() is called to wrap ("box") the native C long in a Python long object (which, in Python 3, which eliminates Python 2's distinction between int and long, is actually shown as type int).
This boxing has to allocate new memory for a Python int object, and spray the native C long's bits into it. In the context of the original example, this object's lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.
This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.
To add to Tim Peters' excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:
You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
PyPy's C API works to some extent, but isn't very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.
Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you're trying to do. It will also be much faster than native Python, if used correctly.
Tim Peters answered why this is slow, but let's see how to improve it.
Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):
import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)
%timeit sum(L)
10 loops, best of 3: 101 ms per loop
%timeit sum(A)
1 loop, best of 3: 237 ms per loop
%timeit sum(N)
1 loop, best of 3: 743 ms per loop
This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:
%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop
So from the list solution to the numpy version this is a factor 16 in runtime.
Let's also check how long creating those data structures takes
%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop
%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop
%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop
%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop
Clear winner: Numpy
Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.
Memory usage of those:
sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096
So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.
N=numpy.arange(10**7, dtype=numpy.int32)
sys.getsizeof(N)
40000096
%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop
But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.
I noticed that typecode L is faster than l, and it also works in I and Q.
Python 3.8.5
Here is the code of the test.
Check it out d_d.
#!/usr/bin/python3
import inspect
from tqdm import tqdm
from array import array
def get_var_name(var):
"""
Gets the name of var. Does it from the out most frame inner-wards.
:param var: variable to get name from.
:return: string
"""
for fi in reversed(inspect.stack()):
names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
if len(names) > 0:
return names[0]
def performtest(func, n, *args, **kwargs):
times = array('f')
times_append = times.append
for i in tqdm(range(n)):
st = time.time()
func(*args, **kwargs)
times_append(time.time() - st)
print(
f"Func {func.__name__} with {[get_var_name(i) for i in args]} run {n} rounds consuming |"
f" Mean: {sum(times)/len(times)}s | Max: {max(times)}s | Min: {min(times)}s"
)
def list_int(start, end, step=1):
return [i for i in range(start, end, step)]
def list_float(start, end, step=1):
return [i + 1e-1 for i in range(start, end, step)]
def array_int(start, end, step=1):
return array("I", range(start, end, step)) # speed I > i, H > h, Q > q, I~=H~=Q
def array_float(start, end, step=1):
return array("f", [i + 1e-1 for i in range(start, end, step)]) # speed f > d
if __name__ == "__main__":
performtest(list_int, 1000, 0, 10000)
performtest(array_int, 1000, 0, 10000)
performtest(list_float, 1000, 0, 10000)
performtest(array_float, 1000, 0, 10000)
Results
Result of the test
please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:
100000000 == 10**8
# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153
I have a function in python that basically takes the sign of an array (75,150), for example.
I'm coming from Matlab and the time execution looks more or less the same less this function.
I'm wondering if sign() works very slowly and you know an alternative to do the same.
Thx,
I can't tell you if this is faster or slower than Matlab, since I have no idea what numbers you're seeing there (you provided no quantitative data at all). However, as far as alternatives go:
import numpy as np
a = np.random.randn(75, 150)
aSign = np.sign(a)
Testing using %timeit in IPython:
In [15]: %timeit np.sign(a)
10000 loops, best of 3: 180 µs per loop
Because the loop over the array (and what happens inside it) is implemented in optimized C code rather than generic Python code, it tends to be about an order of magnitude faster—in the same ballpark as Matlab.
Comparing the exact same code as a numpy vectorized operation vs. a Python loop:
In [276]: %timeit [np.sign(x) for x in a]
1000 loops, best of 3: 276 us per loop
In [277]: %timeit np.sign(a)
10000 loops, best of 3: 63.1 us per loop
So, only 4x as fast here. (But then a is pretty small here.)
I need to get only the fractional part of an array.
using numpy or simply python modf function is convenient.
In case we big arrays of positive fractional data, that can be as big as (1000000,3) for instance, what is more convenient to do:
numpy.modf(array)[0]
array-numpy.trunc(array)
In my opinion 2 is faster and cheaper in memory usage ... but not sure. What do python and numpy experts think ?
I'm not an expert, so I have to use the timeit module to check speed. I use IPython (which makes timing things really easy) but even without it the timeit module is probably the way to go.
In [21]: a = numpy.random.random((10**6, 3))
In [22]: timeit numpy.modf(a)[0]
10 loops, best of 3: 90.1 ms per loop
In [23]: timeit a-numpy.trunc(a)
10 loops, best of 3: 135 ms per loop
In [24]: timeit numpy.mod(a, 1.0)
10 loops, best of 3: 68.3 ms per loop
In [25]: timeit a % 1.0
10 loops, best of 3: 68.1 ms per loop
The last two are equivalent. I don't know much about memory use, but I'd be surprised if modf(a)[0] and a-numpy.trunc(a) both didn't use more memory than simply taking the mod directly.
[BTW, if your code does what you want it to and you're only interested in improvements, you might be interested in the codereview stackexchange. I still don't have a good handle on where the dividing line is, but this feels a little more like their cup of tea.]