Python sortedcontainers is too slow - python

#This is my code
from sortedcontainers import SortedList, SortedSet, SortedDict
import timeit
import random
def test_speed1(data):
SortedList(data)
def test_speed2(data):
sorted_data = SortedList()
for val in data:
sorted_data.add(val)
data = []
numpts = 10 ** 5
for i in range(numpts):
data.append(random.random())
print(f'Num of pts:{len(data)}')
sorted_data = SortedList()
n_runs=10
result = timeit.timeit(stmt='test_speed1(data)', globals=globals(), number=n_runs)
print(f'Speed1 is {1000*result/n_runs:0.0f}ms')
n_runs=10
result = timeit.timeit(stmt='test_speed2(data)', globals=globals(), number=n_runs)
print(f'Speed2 is {1000*result/n_runs:0.0f}ms')
enter image description here
The code for test speed2 is supposed to take 12~ ms (I checked the setup they report). Why does it take 123 ms (10X slowers)???
test_speed1 runs in 15 ms (which makes sense)
I am running in Conda.
The
This is where they outlined the performance
https://grantjenks.com/docs/sortedcontainers/performance.html

You are presumably not executing your benchmark in the same conditions as they do:
you are not using the same benchmark code,
you don't use the same computer with the same performance characteristics,
you are not using the same Python version and environment,
you are not running the same OS,
etc.
Hence, the benchmark results are not comparable and you cannot conclude anything about the performance (and certainly not that "sortedcontainers is too slow").
Performance is only relative to a given execution context and they only stated that their solution is faster relative to other concurrent solutions.
If you really wish to execute the benchmark on your computer, follow the instructions they give in the documentation.

"init() uses Python’s highly optimized sorted() function while add() cannot.". This is why the speed2 is faster than the speed3.
This is the answer I got from the developers on the sortedcontainers library.

Related

How to use python timeit to get median runtime

I want to benchmark a bit of python code (not the language I am used to, but I have to do some comparisons in python). I have understood that timeit is a good tool for this, and I have a code like this:
n = 10
duration = timeit.Timer(my_func).timeit(number=n)
duration/n
to measure the mean runtime of the function. Now, I want to instead have the median time (the reason is that I want to make a comparison to something I get in median time, and it would be good to use the same measure in all cases). Now, timeit only seems to return the full runtime, and not the time of each individual run, so I am not sure how to find the median runtime. What is the best way to get this?
You can use the repeat method instead, which gives you the individual times as a list:
import timeit
from statistics import median
def my_func():
for _ in range(1000000):
pass
n = 10
durations = timeit.Timer(my_func).repeat(repeat=n, number=1)
print(median(durations))
Try it online!

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

numpy OpenBLAS set maximum number of threads

I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks
Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

Is it REALLY true that Python code runs faster in a function?

I saw a comment that lead me to the question Why does Python code run faster in a function?.
I got to thinking, and figured I would try it myself using the timeit library, however I got very different results:
(note: 10**8 was changed to 10**7 to make things a little bit speedier to time)
>>> from timeit import repeat
>>> setup = """
def main():
for i in xrange(10**7):
pass
"""
>>> stmt = """
for i in xrange(10**7):
pass
"""
>>> min(repeat('main()', setup, repeat=7, number=10))
1.4399558753975725
>>> min(repeat(stmt, repeat=7, number=10))
1.4410973942722194
>>> 1.4410973942722194 / 1.4399558753975725
1.000792745732109
Did I use timeit correctly?
Why are these results less 0.1% different from each other, while the results from the other question were nearly 250% different?
Does it only make a difference when using CPython compiled versions of Python (like Cython)?
Ultimately: is Python code really faster in a function, or does it just depend on how you time it?
The flaw in your test is the way timeit compiles the code of your stmt. It's actually compiled within the following template:
template = """
def inner(_it, _timer):
%(setup)s
_t0 = _timer()
for _i in _it:
%(stmt)s
_t1 = _timer()
return _t1 - _t0
"""
Thus stmt is actually running in a function, using the fastlocals array (i.e. STORE_FAST).
Here's a test with your function in the question as f_opt versus the unoptimized compiled stmt executed in the function f_no_opt:
>>> code = compile(stmt, '<string>', 'exec')
>>> f_no_opt = types.FunctionType(code, globals())
>>> t_no_opt = min(timeit.repeat(f_no_opt, repeat=10, number=10))
>>> t_opt = min(timeit.repeat(f_opt, repeat=10, number=10))
>>> t_opt / t_no_opt
0.4931101445632647
It comes down to compiler optimization algorithms. When performing Just-in-time compilation, it is much easier to identify frequently used chunks of code if they're found in functions.
The efficiency gains really would depend on the nature of the tasks being performed. In the example you gave, you aren't really doing anything computationally intensive, leaving fewer opportunities to achieve gains in efficiency through optimization.
As others have pointed out, however, CPython does not do just-in-time compilation. When code is compiled, however, C compilers will often execute them faster.
Check out this document on the GCC compiler: http://gcc.gnu.org/onlinedocs/gcc/Inline.html

Categories