Numpy dot product very slow using ints

Numpy dot product very slow using ints - python

sorry for so many questions. I am running Mac OSX 10.6 on Intel core 2 Duo. I am running some benchmarks for my research and I have run into another thing that baffles me.
If I run
python -mtimeit -s 'import numpy as np; a = np.random.randn(1e3,1e3)' 'np.dot(a,a)'
I get the following output: 10 loops, best of 3: 142 msec per loop
However, if I run
python -mtimeit -s 'import numpy as np; a = np.random.randint(10,size=1e6).reshape(1e3,1e3)' 'np.dot(a,a)'
I get the following output: 10 loops, best of 3: 7.57 sec per loop
Then I ran
python -mtimeit -s 'import numpy as np; a = np.random.randn(1e3,1e3)' 'a*a' And then
python -mtimeit -s 'import numpy as np; a = np.random.randint(10,size=1e6).reshape(1e3,1e3)' 'a*a'
Both ran at about 7.6 msec per loop so it is not the multiplication. Adding had similar speeds as well, so neither of these should be affecting the dot-product, right?
So why is it over 50 times slower to calculate the dot product using ints than using floats?

very interesting, I was curious to see how it was implemented so I did:
>>> import inspect
>>> import numpy as np
>>> inspect.getmodule(np.dot)
<module 'numpy.core._dotblas' from '/Library/Python/2.6/site-packages/numpy-1.6.1-py2.6-macosx-10.6-universal.egg/numpy/core/_dotblas.so'>
>>>
So it looks like its using the BLAS library.
so:
>>> help(np.core._dotblas)
from which I found this:
When Numpy is built with an accelerated BLAS like ATLAS, these functions
are replaced to make use of the faster implementations. The faster
implementations only affect float32, float64, complex64, and complex128
arrays. Furthermore, the BLAS API only includes matrix-matrix,
matrix-vector, and vector-vector products. Products of arrays with larger
dimensionalities use the built in functions and are not accelerated.
So it looks like ATLAS fine tunes certain functions but its only applicable to certain data types, very interesting.
so yeah it looks I'll be using floats more often ...

Using int vs float data types causes different code paths to be executed:
The stack trace for float looks like this:
(gdb) backtr
#0 0x007865a0 in dgemm_ () from /usr/lib/libblas.so.3gf
#1 0x007559d5 in cblas_dgemm () from /usr/lib/libblas.so.3gf
#2 0x00744108 in dotblas_matrixproduct (__NPY_UNUSED_TAGGEDdummy=0x0, args=(<numpy.ndarray at remote 0x85d9090>, <numpy.ndarray at remote 0x85d9090>),
kwargs=0x0) at numpy/core/blasdot/_dotblas.c:798
#3 0x08088ba1 in PyEval_EvalFrameEx ()
...
..while the stack trace for int looks like this:
(gdb) backtr
#0 LONG_dot (ip1=0xb700a280 "\t", is1=4, ip2=0xb737dc64 "\a", is2=4000, op=0xb6496fc4 "", n=1000, __NPY_UNUSED_TAGGEDignore=0x85fa960)
at numpy/core/src/multiarray/arraytypes.c.src:3076
#1 0x00659d9d in PyArray_MatrixProduct2 (op1=<numpy.ndarray at remote 0x85dd628>, op2=<numpy.ndarray at remote 0x85dd628>, out=0x0)
at numpy/core/src/multiarray/multiarraymodule.c:847
#2 0x00742b93 in dotblas_matrixproduct (__NPY_UNUSED_TAGGEDdummy=0x0, args=(<numpy.ndarray at remote 0x85dd628>, <numpy.ndarray at remote 0x85dd628>),
kwargs=0x0) at numpy/core/blasdot/_dotblas.c:254
#3 0x08088ba1 in PyEval_EvalFrameEx ()
...
Both calls lead to dotblas_matrixproduct, but it appears that the float call stays in the BLAS library (probably accessing some well-optimized code), while the int call gets kicked back out to numpy's PyArray_MatrixProduct2.
So this is either a bug or BLAS just doesn't support integer types in matrixproduct (which seems rather unlikely).
Here's an easy and inexpensive workaround:
af = a.astype(float)
np.dot(af, af).astype(int)

Related

Python sortedcontainers is too slow

#This is my code
from sortedcontainers import SortedList, SortedSet, SortedDict
import timeit
import random
def test_speed1(data):
SortedList(data)
def test_speed2(data):
sorted_data = SortedList()
for val in data:
sorted_data.add(val)
data = []
numpts = 10 ** 5
for i in range(numpts):
data.append(random.random())
print(f'Num of pts:{len(data)}')
sorted_data = SortedList()
n_runs=10
result = timeit.timeit(stmt='test_speed1(data)', globals=globals(), number=n_runs)
print(f'Speed1 is {1000*result/n_runs:0.0f}ms')
n_runs=10
result = timeit.timeit(stmt='test_speed2(data)', globals=globals(), number=n_runs)
print(f'Speed2 is {1000*result/n_runs:0.0f}ms')
enter image description here
The code for test speed2 is supposed to take 12~ ms (I checked the setup they report). Why does it take 123 ms (10X slowers)???
test_speed1 runs in 15 ms (which makes sense)
I am running in Conda.
The
This is where they outlined the performance
https://grantjenks.com/docs/sortedcontainers/performance.html

You are presumably not executing your benchmark in the same conditions as they do:
you are not using the same benchmark code,
you don't use the same computer with the same performance characteristics,
you are not using the same Python version and environment,
you are not running the same OS,
etc.
Hence, the benchmark results are not comparable and you cannot conclude anything about the performance (and certainly not that "sortedcontainers is too slow").
Performance is only relative to a given execution context and they only stated that their solution is faster relative to other concurrent solutions.
If you really wish to execute the benchmark on your computer, follow the instructions they give in the documentation.

"init() uses Python’s highly optimized sorted() function while add() cannot.". This is why the speed2 is faster than the speed3.
This is the answer I got from the developers on the sortedcontainers library.

Why is max slower than sort?

I've found that max is slower than the sort function in Python 2 and 3.
Python 2
$ python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'a.sort();a[-1]'
1000 loops, best of 3: 239 usec per loop
$ python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'max(a)'
1000 loops, best of 3: 342 usec per loop
Python 3
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'a.sort();a[-1]'
1000 loops, best of 3: 252 usec per loop
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'max(a)'
1000 loops, best of 3: 371 usec per loop
Why is max (O(n)) slower than the sort function (O(nlogn))?

You have to be very careful when using the timeit module in Python.
python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'a.sort();a[-1]'
Here the initialisation code runs once to produce a randomised array a. Then the rest of the code is run several times. The first time it sorts the array, but every other time you are calling the sort method on an already sorted array. Only the fastest time is returned, so you are actually timing how long it takes Python to sort an already sorted array.
Part of Python's sort algorithm is to detect when the array is already partly or completely sorted. When completely sorted it simply has to scan once through the array to detect this and then it stops.
If instead you tried:
python -m timeit -s 'import random;a=range(100000);random.shuffle(a)' 'sorted(a)[-1]'
then the sort happens on every timing loop and you can see that the time for sorting an array is indeed much longer than to just find the maximum value.
Edit: #skyking's answer explains the part I left unexplained: a.sort() knows it is working on a list so can directly access the elements. max(a) works on any arbitrary iterable so has to use generic iteration.

First off, note that max() uses the iterator protocol, while list.sort() uses ad-hoc code. Clearly, using an iterator is an important overhead, that's why you are observing that difference in timings.
However, apart from that, your tests are not fair. You are running a.sort() on the same list more than once. The algorithm used by Python is specifically designed to be fast for already (partially) sorted data. Your tests are saying that the algorithm is doing its job well.
These are fair tests:
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'max(a[:])'
1000 loops, best of 3: 227 usec per loop
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'a[:].sort()'
100 loops, best of 3: 2.28 msec per loop
Here I'm creating a copy of the list every time. As you can see, the order of magnitude of the results are different: micro- vs milliseconds, as we would expect.
And remember: big-Oh specifies an upper bound! The lower bound for Python's sorting algorithm is Ω(n). Being O(n log n) does not automatically imply that every run takes a time proportional to n log n. It does not even imply that it needs to be slower than a O(n) algorithm, but that's another story. What's important to understand is that in some favorable cases, an O(n log n) algorithm may run in O(n) time or less.

This could be because l.sort is a member of list while max is a generic function. This means that l.sort can rely on the internal representation of list while max will have to go through generic iterator protocol.
This makes that each element fetch for l.sort is faster than each element fetch that max does.
I assume that if you instead use sorted(a) you will get the result slower than max(a).

numpy OpenBLAS set maximum number of threads

I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks

Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

Is there an up-to-date fast YAML parser with python bindings?

What's the latest and greatest for fast YAML parsing in Python? Syck is out of date and recommends using PyYaml, yet PyYaml is pretty slow, and suffers from the GIL problem:
>>> def xit(f, x):
import threading
for i in xrange(x):
threading.Thread(target=f).start()
>>> def stressit():
start = time.time()
res = yaml.load(open(path_to_11000_byte_yaml_file))
print "Took %.2fs" % (time.time() - start,)
>>> xit(stressit, 1)
Took 0.37s
>>> xit(stressit, 2)
Took 1.40s
Took 1.41s
>>> xit(stressit, 4)
Took 2.98s
Took 2.98s
Took 2.99s
Took 3.00s
Given my use case I can cache the parsed objects, but I'd still prefer a faster solution even for that.

The linked wiki page states after the warning "Use libyaml (c), and PyYaml (python)". Although the note does have a bad wikilink (should be PyYAML not PyYaml).
As for performance, depending on how you installed PyYAML you should have the CParser class available which implements a YAML parser written in optimized C. While I don't think this gets around the GIL issue, it is markedly faster. Here are a few cursory benchmarks I ran on my machine (AMD Athlon II X4 640, 3.0GHz, 8GB RAM):
First with the default pure-Python parser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 405 msec per loop
With the CParser:
$ /usr/bin/python2 -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y, Loader=yaml.CLoader)'
10 loops, best of 3: 59.2 msec per loop
And, for comparison, with PyPy using the pure-Python parser.
$ pypy -m timeit -s 'import yaml; y=file("large.yaml", "r").read()' \
'yaml.load(y)'
10 loops, best of 3: 101 msec per loop
For large.yaml I just googled for "large yaml file" and came across this:
https://gist.github.com/nrh/667383/raw/1b3ba75c939f2886f63291528df89418621548fd/large.yaml
(I had to remove the first couple of lines to make it a single-doc YAML file otherwise yaml.load complains.)
EDIT:
Another thing to consider is using the multiprocessing module instead of threads. This gets around GIL problems, but does require a bit more boiler-plate code to communicate between the processes. There are a number of good libraries available though to make multiprocessing easier. There's a pretty good list of them here.

Are NumPy's math functions faster than Python's?

I have a function defined by a combination of basic math functions (abs, cosh, sinh, exp, ...).
I was wondering if it makes a difference (in speed) to use, for example,
numpy.abs() instead of abs()?

Here are the timing results:
lebigot#weinberg ~ % python -m timeit 'abs(3.15)'
10000000 loops, best of 3: 0.146 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'from numpy import abs as nabs' 'nabs(3.15)'
100000 loops, best of 3: 3.92 usec per loop
numpy.abs() is slower than abs() because it also handles Numpy arrays: it contains additional code that provides this flexibility.
However, Numpy is fast on arrays:
lebigot#weinberg ~ % python -m timeit -s 'a = [3.15]*1000' '[abs(x) for x in a]'
10000 loops, best of 3: 186 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(3.15)' 'numpy.abs(a)'
100000 loops, best of 3: 6.47 usec per loop
(PS: '[abs(x) for x in a]' is slower in Python 2.7 than the better map(abs, a), which is about 30 % faster—which is still much slower than NumPy.)
Thus, numpy.abs() does not take much more time for 1000 elements than for 1 single float!

You should use numpy function to deal with numpy's types and use regular python function to deal with regular python types.
Worst performance usually occurs when mixing python builtins with numpy, because of types conversion. Those type conversion have been optimized lately, but it's still often better to not use them. Of course, your mileage may vary, so use profiling tools to figure out.
Also consider the use of programs like cython or making a C module if you want to optimize further your program. Or consider not to use python when performances matters.
but, when your data has been put into a numpy array, then numpy can be really fast at computing bunch of data.

In fact, on numpy array
built in abs calls numpy's implementation via __abs__, see Why built-in functions like abs works on numpy array?
So, in theory there shouldn't be much performance difference.
import timeit
x = np.random.standard_normal(10000)
def pure_abs():
return abs(x)
def numpy_abs():
return np.abs(x)
n = 10000
t1 = timeit.timeit(pure_abs, number = n)
print('Pure Python abs:', t1)
t2 = timeit.timeit(numpy_abs, number = n)
print('Numpy abs:', t2)
Pure Python abs: 0.435754060745
Numpy abs: 0.426516056061

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy dot product very slow using ints - python

Related

Python sortedcontainers is too slow

Why is max slower than sort?

numpy OpenBLAS set maximum number of threads

Is there an up-to-date fast YAML parser with python bindings?

Are NumPy's math functions faster than Python's?

Categories

Resources