Efficient computation of entropy-like formula (sum(xlogx)) in Python - python

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.
Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.
So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.
Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are:
matmul: 16.720187613
xlogy: 17.296380516
nansum: 20.059866123000003
import timeit
import numpy as np
from scipy import special
def matmul(arg):
a, log_a = arg
log_a.fill(0)
np.log2(a, where=a > 0, out=log_a)
return (a[:, None, :] # log_a[..., None]).ravel()
def xlogy(arg):
a, log_a = arg
a[a < 0] = 0
return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))
def nansum(arg):
a, log_a = arg
return np.nansum(a * np.log2(a, out=log_a), axis=1)
def setup():
a = np.random.rand(20, 1000) - 0.1
log = np.empty_like(a)
return a, log
setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup()
"""
func_code = "matmul(data)"
print(timeit.timeit(func_code, setup=setup_code, number=100000))

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.
The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).
If you do not have access to the SVML there are two possible remaining options:
Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.
Here is an example of multi-threaded Numba solution:
import numba as nb
#nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
a, log_a = arg
result = np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
s = 0.0
for j in range(a.shape[1]):
if a[i, j] > 0:
s += a[i, j] * np.log2(a[i, j])
result[i] = s
return result
This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use
np.nansum(v_i*np.log2(v_i))

Related

Vectorization of recursively defined problem - Python

I was wondering if anyone would have an idea on how I am able to vectorize the following loop:
for i in range(1,(T*n)+1):
Y = Y + np.diag(mu) # Y * dt + np.multiply(np.diag(sigma)#Y, L # np.random.normal( 0, dt, (d,N)))
Whereas the following parameters are already a dxN matrices (I already vectorized a loop with that..):
Y (this is the recursive Parameter)
np.diag(mu) # Y * dt
np.diag(sigma) # Y
L # np.random.normal( 0, dt, (d,N))
Any help would be very appreciated. :)
With best regards!
Unfortunately, this doesn't look like vectorizable code:
Iterations should be independent. Typically, vectorization means making several iterations at once. Typically, it also implies using AVX, SSE or FMA instructions (if we talk about x86 processors) to make iterations go truly in parallel on a hardware level.
Continuing about vector assembly instructions, such level of optimization is typically unreachable from python code because the interpreter isn't that smart. An iteration is also doing too much to be vectorized. It actually contains sub-loops! We don't see it but matrix multiplications do involve more loops.
So I woudn't call optimization of this loop a "vectorization". But luckily, there are still things to check:
Profile it. Find out what part of the computation consumes most of the time.
Verify that np.random doesn't slow down the program significantly. If yes, you can rely on pre-generated values instead.
Check if code that can be vectorized is vectorized. That means, verify that your numpy is built with SSE/AVX support and that matrix multiplications use that under the hood. It can be a bit tricky to do but up to x4 speedups* are possible with AVX usage.
If parts of the code are indeed vectorized on the assembly level, switching to storing data in float16 arrays can make it even faster. To my knowledge, AVX does support operations on large blocks of 16-bit floats.
Rewrite it in C/Cython or try out Numba JIT compilation for the same task.
If compilation even with Numba is not the case, I wonder if Tensorflow can help here. With Tensorflow, Python code doesn't kick off computations immediately but rather constructs a computational graph that is then executed without returning to the interpreter level. Tensorflow does support AVX and SSE (although not without pain), so you may expect more control over low-level details than with numpy. And you can also try to launch it on GPU.
Last thing, I don't quite believe in it, but does loop unrolling help?
for i in range(1, (T * n + 1) // 4):
Y = Y + ...
Y = Y + ...
Y = Y + ...
Y = Y + ...
* - subject to Amdahl's law

Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine

UPDATE
Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performance in C and when called from python.
What was obscuring things, was even if linking the compiled shared libraries explicitly with the newer MKL, and pointing through LD_* variables to them, and then in python doing import numpy, was somehow making python call old MKL libraries. Only by replacing in python lib folder all libmkl_*.so with newer MKL I was able to match performance in python and C calls.
Background / library info.
Matrix multiplication was done via sgemm (single-precision) and dgemm (double-precision) Intel's MKL library calls, via numpy.dot function. The actual call of the library functions can be verified with e.g. oprof.
Using here 2x18 core CPU E5-2699 v3, hence a total of 36 physical cores.
KMP_AFFINITY=scatter. Running on linux.
TL;DR
1) Why is numpy.dot, even though it is calling the same MKL library functions, twice slower at best compared to C compiled code?
2) Why via numpy.dot you get performance decreasing with increasing number of cores, whereas the same effect is not observed in C code (calling the same library functions).
The problem
I've observed that doing matrix multiplication of single/double precision floats in numpy.dot, as well as calling cblas_sgemm/dgemm directly from a compiled C shared library give noticeably worse performance compared to calling same MKL cblas_sgemm/dgemm functions from inside pure C code.
import numpy as np
import mkl
n = 10000
A = np.random.randn(n,n).astype('float32')
B = np.random.randn(n,n).astype('float32')
C = np.zeros((n,n)).astype('float32')
mkl.set_num_threads(3); %time np.dot(A, B, out=C)
11.5 seconds
mkl.set_num_threads(6); %time np.dot(A, B, out=C)
6 seconds
mkl.set_num_threads(12); %time np.dot(A, B, out=C)
3 seconds
mkl.set_num_threads(18); %time np.dot(A, B, out=C)
2.4 seconds
mkl.set_num_threads(24); %time np.dot(A, B, out=C)
3.6 seconds
mkl.set_num_threads(30); %time np.dot(A, B, out=C)
5 seconds
mkl.set_num_threads(36); %time np.dot(A, B, out=C)
5.5 seconds
Doing exactly the same as above, but with double precision A, B and C, you get:
3 cores: 20s, 6 cores: 10s, 12 cores: 5s, 18 cores: 4.3s, 24 cores: 3s, 30 cores: 2.8s, 36 cores: 2.8s.
The topping up of speed for single precision floating points seem to be associated with cache misses.
For 28 core run, here is the output of perf.
For single precision:
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./ptestf.py
631,301,854 cache-misses # 31.478 % of all cache refs
And double precision:
93,087,703 cache-misses # 5.164 % of all cache refs
C shared library, compiled with
/opt/intel/bin/icc -o comp_sgemm_mkl.so -openmp -mkl sgem_lib.c -lm -lirc -O3 -fPIC -shared -std=c99 -vec-report1 -xhost -I/opt/intel/composer/mkl/include
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C);
void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C)
{
int i, j;
float alpha, beta;
alpha = 1.0; beta = 0.0;
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
}
Python wrapper function, calling the above compiled library:
def comp_sgemm_mkl(A, B, out=None):
lib = CDLL(omplib)
lib.cblas_sgemm_mkl.argtypes = [c_int, c_int, c_int,
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
np.ctypeslib.ndpointer(dtype=np.float32, ndim=2)]
lib.comp_sgemm_mkl.restype = c_void_p
m = A.shape[0]
n = B.shape[0]
k = B.shape[1]
if np.isfortran(A):
raise ValueError('Fortran array')
if m != n:
raise ValueError('Wrong matrix dimensions')
if out is None:
out = np.empty((m,k), np.float32)
lib.comp_sgemm_mkl(m, n, k, A, B, out)
However, explicit calls from a C-compiled binary calling MKL's cblas_sgemm / cblas_dgemm, with arrays allocated through malloc in C, gives almost 2x better performance compared to the python code, i.e. the numpy.dot call. Also, the effect of performance degradation with increasing number of cores is NOT observed. The best performance was 900 ms for single-precision matrix multiplication and was achieved when using all 36 physical cores via mkl_set_num_cores and running the C code with numactl --interleave=all.
Perhaps any fancy tools or advice for profiling/inspecting/understanding this situation further? Any reading material is much appreciated as well.
UPDATE
Following #Hristo Iliev advice, running numactl --interleave=all ./ipython did not change the timings (within noise), but improves the pure C binary runtimes.
I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.
Applying:
KMP_AFFINITY=scatter,granularity=thread
This ensures that the 28 threads are always running on the same processor thread.
Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.
This is on a 28 core dual socket Xeon E5-2680 v3 system.
Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.
Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.
Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact - but for some reason that is totally messed up on my system. You can add ,verbose to the variable to see how the runtime is pinning your threads.
likwid-pin is a useful alternative providing more convenient control.
In general single precision should be at least as fast as double precision. Double precision can be slower because:
You need more memory/cache bandwidth for double precision.
You can build ALUs that have higher througput for single precision, but that usually doesn't apply to CPUs but rather GPUs.
I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.
When you scale up the number of threads for MKL/*gemm, consider
Memory /shared cache bandwidth may become a bottleneck, limiting the scalability
Turbo mode will effectively decrease the core frequency when increasing utilization. This applies even when you run at nominal frequency: On Haswell-EP processors, AVX instructions will impose a lower "AVX base frequency" - but the processor is allowed to exceed that when less cores are utilized / thermal headroom is available and in general even more for a short time. If you want perfectly neutral results, you would have to use the AVX base frequency, which is 1.9 GHz for you. It is documented here, and explained in one picture.
I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.

pyOpenCL getting different results compared to numpy

I'm trying to get started with pyOpenCL and GPGPU in general.
For the below dot product code I'm getting fairly different results between the GPU and CPU versions. What am I doing wrong?
The difference of ~0.5% seems large for floating point errors to account for the difference. The difference does seem to increase with array size (~9e-8 relative difference with array size of 10000). Maybe it's an issue with combining results across blocks...? Either way, color me disconcerted.
I don't know if it matters: I'm running this on a MacBook Air, Intel(R) Core(TM) i7-4650U CPU # 1.70GHz, with Intel HD Graphics 5000.
Thanks in advance.
import pyopencl as cl
import numpy
from pyopencl.reduction import ReductionKernel
import pyopencl.clrandom as cl_rand
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
dot = ReductionKernel( ctx, \
dtype_out = numpy.float32, \
neutral = "0", \
map_expr = "x[i]*y[i]", \
reduce_expr = "a+b", \
arguments = "__global const float *x, __global const float *y"
)
x = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
y = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
x_dot_y = dot(x,y).get() # GPU: array(25001304.0, dtype=float32)
x_dot_y_cpu = numpy.dot(x.get(), y.get()) # CPU: 24875690.0
print abs(x_dot_y_cpu - x_dot_y)/x_dot_y # 0.0050496689740063489
The order in which values are reduced will likely be very different between these two methods. Across large data sets, the tiny errors in floating point rounding can soon add up. There could also be other details about the underlying implementations that affect the precision of the result.
I've run your example code on my own machine and get a similar sort of difference in the final result (~0.5%). As a data point, you can implement a very simple dot product in raw Python and see how much that differs from both the OpenCL result and from Numpy.
For example, you could add something simple like this to your example:
x_dot_y_naive = sum(a*b for a,b in zip(x.get(), y.get()))
Here's the results I get on my machine:
OPENCL: 25003466.000000
NUMPY: 24878146.000000 (0.5012%)
NAIVE: 25003465.601387 (0.0000%)
As you can see, the naive implementation is closer to the OpenCL version than Numpy is. One explanation for this could be that Numpy's dot function probably makes use of fused multiply-add (FMA) operations, which will change how intermediate results are rounded. Without any compiler options to tell it otherwise, OpenCL should be fully complying to the IEE-754 standard, rather than using the faster FMA operations.

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

Optimising memory usage in numpy

The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))
This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)
if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.

Categories