Faster rolling of random Gaussian vectors

Faster rolling of random Gaussian vectors - python

For a Monte-Carlo like simulation, I need to pick at random thousands of random Gaussian vectors (that is, vectors having independently normally distributed entries). Each such vector is of fixed length (around 100).
NumPy has a method of achieving this:
import numpy.random
vectors = [numpy.random.normal(size=100) for _ in xrange(10000)]
NumPy's random.normal function is of linear complexity, with overhead for small size values. However, it looks like that overhead is not significant for size=100 (perhaps around 30%, tested empirically; compare with the overhead for size=1, which is about 2300%). Perhaps I can save some of this overhead by rolling once, then splitting the array (haven't tried that yet).
However, it is still much too slow for my needs. Perhaps I'm too greedy here; I know that NumPy's randomization functions are written in c with optimization in mind; still,
timeit numpy.random.normal(size=100)
# 100000 loops, best of 3: 5.8 us per loop
(tested inside IPython, using its magic %timeit)
That makes ~0.06 seconds for 10k vectors. I was wondering whether there's a much faster method which will allow me to roll 10k vectors of size 100 (say) within less than 0.6ms, that is, 100 times faster. A solution may involve c extensions or whatever needed.
Update
A very simple c++ code, based on an example from cppreference, shows a much better performance:
#include <iostream>
#include <random>
int main()
{
float x;
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution <> d(0,1);
for(int i=0; i < 100000; i++)
{
x = d(gen);
}
std::cout << x << '\n';
return 0;
}
and time shows:
real 0m0.028s
user 0m0.020s
sys 0m0.004s
which is about X20 times faster than what NumPy gives. However, I am not sure about the overhead of c-extensions for python, and I have no intuition about whether this can become a python function which is faster than numpy.random.normal.

Related

Efficient computation of entropy-like formula (sum(xlogx)) in Python

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.
Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.
So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.
Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are:
matmul: 16.720187613
xlogy: 17.296380516
nansum: 20.059866123000003
import timeit
import numpy as np
from scipy import special
def matmul(arg):
a, log_a = arg
log_a.fill(0)
np.log2(a, where=a > 0, out=log_a)
return (a[:, None, :] # log_a[..., None]).ravel()
def xlogy(arg):
a, log_a = arg
a[a < 0] = 0
return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))
def nansum(arg):
a, log_a = arg
return np.nansum(a * np.log2(a, out=log_a), axis=1)
def setup():
a = np.random.rand(20, 1000) - 0.1
log = np.empty_like(a)
return a, log
setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup()
"""
func_code = "matmul(data)"
print(timeit.timeit(func_code, setup=setup_code, number=100000))

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.
The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).
If you do not have access to the SVML there are two possible remaining options:
Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.
Here is an example of multi-threaded Numba solution:
import numba as nb
#nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
a, log_a = arg
result = np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
s = 0.0
for j in range(a.shape[1]):
if a[i, j] > 0:
s += a[i, j] * np.log2(a[i, j])
result[i] = s
return result
This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use
np.nansum(v_i*np.log2(v_i))

Decomposition of matrices for CPLEX and machine learning application

I am dealing with big matrices and time to time my code ends with 'killed:9' message in my terminal. I'm working on Mac OSx.
A wise programmer tells me the problem in my code is liked to the stored matrix I am dealing with.
nn = 35000
dd = 35
XX = np.random.rand(nn,dd)
XX = XX.dot(XX.T) #it should be faster than np.dot(XX,XX.T)
yy = np.random.rand(nn,1)
XX = np.multiply(XX,yy.T)
I have to store this huge matrix XX, my guess: I split the matrix with
upp = np.triu(XX)
Do I actually save space in terms of stored data?
What if later on I store
low = app.T
am I wasting memory and computational time?

It should take up the same total amount of memory. To avoid the error you are probably looking at a few options:
Process batch wise
If you create your model over the CPLEX API, once you supplied the data it is handled by CPLEX I believe. So you could split the data and load it piece by piece and add it to the model consecutively.
Allocate memory manually
If you use Cython you can use the function malloc to allocate memory manually for your array, the size will very likely be no issue then.
Option 1 would be the preferred option in my opinion.
EDIT:
I constructed a little example. It actually combines the two options. The array is not stored as a Python object, but as a C array and the values are computed piecewise.
I am allocating the memory for the array using Cython and malloc. To run the code you have to install Cython.Then you can open a python interpreter at the directory you saved the file and write:
import pyximport;pyximport.install()
import nameofscript
An example for processing your array:
import numpy as np
from libc.stdlib cimport malloc # Allocate memory manually
from cython.parallel import prange # Parallel processing without GIL
dd = 35
# With cdef we can define C variables in Cython.
cdef double **XXN
cdef double y[35000]
cdef int i, j, nn
nn = 35000
# Allocate memory for the Matrix with 1.225 billion double elements
XXN = <double **>malloc(nn * sizeof(double *))
for i in range(nn):
XXN[i] = <double *>malloc(nn * sizeof(double))
XX = np.random.rand(nn,dd)
for i in range(nn):
for j in range(nn):
# Compute the values for the new matrix element by element
XXN[i][j] = XX[i].dot(XX[j].T)
# Multiply the new matrix with y column wise
for i in prange(nn, nogil=True, num_threads=4):
for j in range(nn):
XXN[i][j] = XXN[i][j] * y[i]
Save this file as nameofscript.pyx and run it as described above. I have briefly tested this script and it runs about half an hour on my machine. You can extend this script and use the result array XXN for your further computations.
A little example for parallelization: I did not initialize y and did not assign any values. If you declare y as a C array, you can e. g. assign some values from python objects to fill it with values. Then, you can conduct the last multiplication without GIL, in a parallelized manner, as shown in the code sample.
Regarding computational efficiency: This is probably not the fastest way (which may be writing your code for the CPLEX C Interface entirely maybe), but it does not throw the memory error and does run in an acceptable time if you do not have to repeat this computation too often.

Pypy (python) optimization

I'm looking into replacing some C code with python code and using pypy as the interpreter. The code does a lot of list/dictionary operations. Therefore to get a vague idea of the performance of pypy vs C I am writing sorting algorithms. To test all my read functions I wrote a bubble sort, both in python and C++. CPython of course sucks 6.468s, pypy came in at 0.366s and C++ at 0.229s. Then I remembered that I had forgotten -O3 on the C++ code and the time went to 0.042s. For a 32768 dataset C++ with -O3 is only 2.588s and pypy is 19.65s. Is there anything I can do to speed up my python code (besides using a better sort algorithm of course) or how I use pypy (some flag or something)?
Python code (read_nums module omitted since it's time is trivial: 0.036s on 32768 dataset):
import read_nums
import sys
nums = read_nums.read_nums(sys.argv[1])
done = False
while not done:
done = True
for i in range(len(nums)-1):
if nums[i] > nums[i+1]:
nums[i], nums[i+1] = nums[i+1], nums[i]
done = False
$ time pypy-c2.0 bubble_sort.py test_32768_1.nums
real 0m20.199s
user 0m20.189s
sys 0m0.009s
C code (read_nums function again omitted since it takes little time: 0.017s):
#include <iostream>
#include "read_nums.h"
int main(int argc, char** argv)
{
std::vector<int> nums;
int count, i, tmp;
bool done;
if(argc < 2)
{
std::cout << "Usage: " << argv[0] << " filename" << std::endl;
return 1;
}
count = read_nums(argv[1], nums);
done = false;
while(!done)
{
done = true;
for(i=0; i<count-1; ++i)
{
if(nums[i] > nums[i+1])
{
tmp = nums[i];
nums[i] = nums[i+1];
nums[i+1] = tmp;
done = false;
}
}
}
for(i=0; i<count; ++i)
{
std::cout << nums[i] << ", ";
}
return 0;
}
$ time ./bubble_sort test_32768_1.nums > /dev/null
real 0m2.587s
user 0m2.586s
sys 0m0.001s
P.S. Some of the numbers given in the first paragraph are a little different then the numbers from time because they're the numbers I got the first time.
Further improvements:
Just tried xrange instead of range and the run time went to 16.370s.
Moved the code starting from first done = False to last done = False in a function, speed is now 8.771-8.834s.

The most relevant way to answer this question is to note that the speed of C, CPython and PyPy are not differing by a constant factor: it depends most importantly on what is done and the way it is written. For example, if your C code is doing naive things like walking arrays when the "equivalent" Python code would naturally use dictionaries, then any implementation of Python is faster than C provided the arrays are long enough. Of course, this is not the case on most real-life examples, but the same argument still applies to a smaller extent. There is no one-size-fits-all way to predict the relative speed of a program written in C, or rewritten in Python and running on CPython or PyPy.
Obviously there are guidelines about these relative speeds: on small algorithmical examples you could expect the speed of PyPy to be approaching that of "gcc -O0". In your example it is "only" 1.6x slower. We might help you optimize it, or even find optimizations missing in PyPy, in order to gain 10% or 30% more speed. But this is a tiny example that has nothing to do with your real program. For the reasons above the speed we get here is only vaguely related to the speed you'll get in the end.
I can only say that rewriting code from C to Python for reasons of clarity, notably when the C has become too tangled up for further developments, is clearly a win in the long run --- even in the case where at the end you need to rewrite some parts of it in C again. And PyPy's goal here is to reduce the need for that. While it would be nice to say that no-one ever needs C any more, it's just not true :-)

Why is my python/numpy example faster than pure C implementation?

I have pretty much the same code in python and C. Python example:
import numpy
nbr_values = 8192
n_iter = 100000
a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
a = numpy.sin(a)
C example:
#include <stdio.h>
#include <math.h>
int main(void)
{
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
double x;
for (j = 0; j < nbr_values; j++){
x = 1;
for (i=0; i<n_iter; i++)
x = sin(x);
}
return 0;
}
Something strange happen when I ran both examples:
$ time python numpy_test.py
real 0m5.967s
user 0m5.932s
sys 0m0.012s
$ g++ sin.c
$ time ./a.out
real 0m13.371s
user 0m13.301s
sys 0m0.008s
It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?
P.S. I have Ubuntu 12.04, 8G ram, core i5 btw

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.
Here is equivalent C code:
sinary2.c:
#include <math.h>
#include <stdlib.h>
float *sin_array(const float *input, size_t elements)
{
int i = 0;
float *output = malloc(sizeof(float) * elements);
for (i = 0; i < elements; ++i) {
output[i] = sin(input[i]);
}
return output;
}
sinary.c:
#include <math.h>
#include <stdlib.h>
extern float *sin_array(const float *input, size_t elements)
int main(void)
{
int i;
int nbr_values = 8192;
int n_iter = 100000;
float *x = malloc(sizeof(float) * nbr_values);
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i=0; i<n_iter; i++) {
float *newary = sin_array(x, nbr_values);
free(x);
x = newary;
}
return 0;
}
Results:
$ time python foo.py
real 0m5.986s
user 0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out
real 0m5.204s
user 0m4.995s
sys 0m0.208s
The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.
Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.
Simply calling the sin function the appropriate number of times does not make for an equivalent test.
Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.
You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.
If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.
Newer mind the fact that your code isn't quite the same on both sides... ;)

You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.
The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.
https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en
If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.
Here is a very good article on these very points (I'd use SciPy):
https://www.scipy.org/scipylib/faq.html

DLR & Performance

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.

delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}

If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.