I was using boost and QuantLib to produce a 'Array' containing random numbers (standard normal distributed). However, I noticed that the computation performance was not very desirable, and the speed was much slower than simply using numpy of python. Can anyone give me some suggestions?
Many thanks.
Here is my c++ code:
using namespace QuantLib;
Array generateRandNumbers(unsigned long seed, Size n) {
Array res(n);
boost::mt19937 rnd(seed);
boost::normal_distribution<> normDist(0, 1);
boost::variate_generator<boost::mt19937&, boost::normal_distribution<>> generator_norm(rnd, normDist);
BOOST_FOREACH(Real& x, res) x = generator_norm();
return res;
}
int main()
{
unsigned long seed = 1;
Size n = 1e6;
boost::timer timer;
Array randNumbers = generateRandNumbers(seed, n);
std::cout << timer.elapsed() << std::endl;
return 0
}
And this is my python code:
import numpy as np
import time
ts = time.time()
res = np.random.normal(0, 1, 1000000);
print(time.time() - ts)
You asked:
Can anyone give me some suggestions?
The programing language isn't likely to be play a major factor here. Therefore this questions comes down to other parameters:
Random number algorithm used (eg. Mersenne Twister)
Implementation details of algorithm that may affect perfomance
How the code is compiled, linked and run. (release, debugging session, compiler optimizations)
As for the last point - make sure that when making comparison you always use an optimized release build.
Related
I've started messing around with parallel programming and cython/openmp, and I have a simple program that sums over an array using prange:
import numpy as np
from cython.parallel import prange
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def parallel_summation(double[:] vec):
cdef int n = vec.shape[0]
cdef double total
cdef int i
for i in prange(n, nogil=True):
total += vec[i]
return total
It seems to work OK with a setup.py file. However, I was wondering if it is possible to adjust this function and have a little more control over what the processors are doing.
Let's say I have 4 processors: I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum. From the cython documentation, I wasn't able to gather whether something like this is possible or not (the documentation is a little sparse).
I'd appreciate if someone could explain if/how something like this is done using cython/openmp, or maybe help locate some relevant examples (its been surprisingly hard to find simple ones online).
I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum.
That's exactly what's happening here already. Cython infers from your inplace operation that you want to do a reduction. OpenMP will implement a parallel loop with private (zero initialized) copies of the total variable and add them all to total at the end of the loop.
In the generated C, this looks like this:
#pragma omp parallel
{
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) reduction(+:__pyx_v_total)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
__pyx_t_4 = __pyx_v_i;
__pyx_v_total = (__pyx_v_total + (*((double *) ( /* dim=0 */ (__pyx_v_vec.data + __pyx_t_4 * __pyx_v_vec.strides[0]) ))));
}
}
}
You just need to enable OpenMP as described here.
The one thing that you should change in your code, is to initialize total = 0, otherwise it's just an unitialized C variable with may contain garbage.
I have noticed that you can put various numbers inside of numpy.random.seed(), for example numpy.random.seed(1), numpy.random.seed(101). What do the different numbers mean? How do you choose the numbers?
Consider a very basic random number generator:
Z[i] = (a*Z[i-1] + c) % m
Here, Z[i] is the ith random number, a is the multiplier and c is the increment - for different a, c and m combinations you have different generators. This is known as the linear congruential generator introduced by Lehmer. The remainder of that division, or modulus (%), will generate a number between zero and m-1 and by setting U[i] = Z[i] / m you get random numbers between zero and one.
As you may have noticed, in order to start this generative process - in order to have a Z[1] you need to have a Z[0] - an initial value. This initial value that starts the process is called the seed. Take a look at this example:
The initial value, the seed is determined as 7 to start the process. However, that value is not used to generate a random number. Instead, it is used to generate the first Z.
The most important feature of a pseudo-random number generator would be its unpredictability. Generally, as long as you don't share your seed, you are fine with all seeds as the generators today are much more complex than this. However, as a further step you can generate the seed randomly as well. You can skip the first n numbers as another alternative.
Main source: Law, A. M. (2007). Simulation modeling and analysis. Tata McGraw-Hill.
The short answer:
There are three ways to seed() a random number generator in numpy.random:
use no argument or use None -- the RNG initializes itself from the OS's random number generator (which generally is cryptographically random)
use some 32-bit integer N -- the RNG will use this to initialize its state based on a deterministic function (same seed → same state)
use an array-like sequence of 32-bit integers n0, n1, n2, etc. -- again, the RNG will use this to initialize its state based on a deterministic function (same values for seed → same state). This is intended to be done with a hash function of sorts, although there are magic numbers in the source code and it's not clear why they are doing what they're doing.
If you want to do something repeatable and simple, use a single integer.
If you want to do something repeatable but unlikely for a third party to guess, use a tuple or a list or a numpy array containing some sequence of 32-bit integers. You could, for example, use numpy.random with a seed of None to generate a bunch of 32-bit integers (say, 32 of them, which would generate a total of 1024 bits) from the OS's RNG, store in some seed S which you save in some secret place, then use that seed to generate whatever sequence R of pseudorandom numbers you wish. Then you can later recreate that sequence by re-seeding with S again, and as long as you keep the value of S secret (as well as the generated numbers R), no one would be able to reproduce that sequence R. If you just use a single integer, there's only 4 billion possibilities and someone could potentially try them all. That may be a bit on the paranoid side, but you could do it.
Longer answer
The numpy.random module uses the Mersenne Twister algorithm, which you can confirm yourself in one of two ways:
Either by looking at the documentation for numpy.random.RandomState, of which numpy.random uses an instance for the numpy.random.* functions (but you can also use an isolated independent instance of)
Looking at the source code in mtrand.pyx which uses something called Pyrex to wrap a fast C implementation, and randomkit.c and initarray.c.
In any case here's what the numpy.random.RandomState documentation says about seed():
Compatibility Guarantee A fixed seed and a fixed series of calls to RandomState methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged.
Parameters:
seed : {None, int, array_like}, optional
Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
It doesn't say how the seed is used, but if you dig into the source code it refers to the init_by_array function: (docstring elided)
def seed(self, seed=None):
cdef rk_error errcode
cdef ndarray obj "arrayObject_obj"
try:
if seed is None:
with self.lock:
errcode = rk_randomseed(self.internal_state)
else:
idx = operator.index(seed)
if idx > int(2**32 - 1) or idx < 0:
raise ValueError("Seed must be between 0 and 2**32 - 1")
with self.lock:
rk_seed(idx, self.internal_state)
except TypeError:
obj = np.asarray(seed).astype(np.int64, casting='safe')
if ((obj > int(2**32 - 1)) | (obj < 0)).any():
raise ValueError("Seed must be between 0 and 2**32 - 1")
obj = obj.astype('L', casting='unsafe')
with self.lock:
init_by_array(self.internal_state, <unsigned long *>PyArray_DATA(obj),
PyArray_DIM(obj, 0))
And here's what the init_by_array function looks like:
extern void
init_by_array(rk_state *self, unsigned long init_key[], npy_intp key_length)
{
/* was signed in the original code. RDH 12/16/2002 */
npy_intp i = 1;
npy_intp j = 0;
unsigned long *mt = self->key;
npy_intp k;
init_genrand(self, 19650218UL);
k = (RK_STATE_LEN > key_length ? RK_STATE_LEN : key_length);
for (; k; k--) {
/* non linear */
mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >> 30)) * 1664525UL))
+ init_key[j] + j;
/* for > 32 bit machines */
mt[i] &= 0xffffffffUL;
i++;
j++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
if (j >= key_length) {
j = 0;
}
}
for (k = RK_STATE_LEN - 1; k; k--) {
mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941UL))
- i; /* non linear */
mt[i] &= 0xffffffffUL; /* for WORDSIZE > 32 machines */
i++;
if (i >= RK_STATE_LEN) {
mt[0] = mt[RK_STATE_LEN - 1];
i = 1;
}
}
mt[0] = 0x80000000UL; /* MSB is 1; assuring non-zero initial array */
self->gauss = 0;
self->has_gauss = 0;
self->has_binomial = 0;
}
This essentially "munges" the random number state in a nonlinear, hash-like method using each value within the provided sequence of seed values.
What is normally called a random number sequence in reality is a "pseudo-random" number sequence because the values are computed using a deterministic algorithm and probability plays no real role.
The "seed" is a starting point for the sequence and the guarantee is that if you start from the same seed you will get the same sequence of numbers. This is very useful for example for debugging (when you are looking for an error in a program you need to be able to reproduce the problem and study it, a non-deterministic program would be much harder to debug because every run would be different).
Basically the number guarantees the same 'randomness' every time.
More properly, the number is a seed, which can be an integer, an array (or other sequence) of integers of any length, or the default (none). If seed is none, then random will try to read data from /dev/urandom if available or make a seed from the clock otherwise.
Edit: In most honesty, as long as your program isn't something that needs to be super secure, it shouldn't matter what you pick. If this is the case, don't use these methods - use os.urandom() or SystemRandom if you require a cryptographically secure pseudo-random number generator.
The most important concept to understand here is that of pseudo-randomness. Once you understand this idea, you can determine if your program really needs a seed etc. I'd recommend reading here.
To understand the meaning of random seeds, you need to first understand the "pseudo-random" number sequence because the values are computed using a deterministic algorithm.
So you can think of this number as a starting value to calulate the next number you get from the random generator. Putting the same value here will make your program getting the same "random" value everytime, so your program becomes deterministic.
As said in this post
they (numpy.random and random.random) both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next.
If you really care about randomness, ask the user to generate some noise (some arbitary words) or just put the system time as seed.
If your codes run on Intel CPU (or AMD with newest chips) I also suggest you to check the RdRand package which uses the cpu instruction rdrand to collect "true" (hardware) randomness.
Refs:
Random seed
What is a seed in terms of generating a random number
One very specific answer: np.random.seed can take values from 0 and 2**32 - 1, which interestingly differs from random.seed which can take any hashable object.
A side comment: better set your seed to a rather large number but still within the generator limit. Doing so can let the seed number have a good balance of 0 and 1 bits. Avoid having many 0 bits in the seed.
Reference: pyTorch documentation
I have this program in Python:
# ...
print 2 ** (int(input())-1) % 1000000007
The problem is that this program works a long time on big numbers. I rewrote my code using C++, but sometimes I have a wrong answer. For example, in Python code for number 12345678 I've got 749037894 and its correct, but in C++ I've got -291172004.
This is the C++ code:
#include <iostream>
#include <cmath>
using namespace std;
const int MOD = 1e9 + 7;
int main() {
// ...
long long x;
cin >> x;
long long a =pow(2, (x-1));
cout << a % MOD;
}
As already mentioned, your problem is that for large exponent you have integer overflow.
To overcome this, remember that modular multiplication has such property that:
(A * B) mod C = (A mod C * B mod C) mod C
And then you can implement 'e to the power p modulo m' function using fast exponentiation scheme.
Assuming no negative powers:
long long powmod(long long e, long long p, long long m){
if (p == 0){
return 1;
}
long long a = 1;
while (p > 1){
if (p % 2 == 0){
e = (e * e) % m;
p /= 2;
} else{
a = (a * e) % m;
e = (e * e) % m;
p = (p - 1) / 2;
}
}
return (a * e) % m;
}
Note that remainder is taken after every multiplication, so no overflow can occur, if single multiplication doesn't overflow (and that's true for 1000000007 as m and long long).
You seem to be dealing with positive numbers and those are overflowing the number of bits you've allocated for their storage. Also keep in mind that there is a difference between Python and C/C++ in the way a modulo on a negative value is computed. To get a similar computation, you will need to add the Modulo to the value so it's positive before you take the modulo which is the way it works in Python:
cout << (a+MOD) % MOD;
You may have to add MOD n times till the temporary value is positive before taking its modulo.
Like has been mentioned by many of the other answers, your problem lies in integer overflow.
You can do like suggested by deniss and implement your own modmul() and modpow() functions.
If, however, this is part of a project that will need to do plenty of calculations with very large numbers, I would suggest using a "big number library" like GNU GMP or mbedTLS Bignum library.
In C++ the various fundamental types have fixed sizes. For example a long long is typically 64 bits wide. But width varies with system type and other factors. As suggested above you can check climits.h for your particular environment's limits.
Raising 2 to the power 12345677 would involve shifting the binary number 10 left by 12345676 bits which wouldn't fit in a 64 bit long long (and I suspect is unlikely to fit in most long long implementations).
Another factor to consider is that pow returns a double (or long double) depending on the overload used. You don't say what compiler you are using but most likely you got a warning about possible truncation or data loss when the result of calling pow is assigned to the long long variable a.
Finally, even if pow is returning a long double I suspect the exponent 12345677 is too large to be stored in a long double so pow is probably returning positive infinity which then gets truncated to some bit pattern that will fit in a long long. You could certainly check that by introducing an intermediate long double variable to receive the value of pow which you could then examine in a debugger.
I am currently using the book 'Programming in D' for learning D. I tried to solve a problem of summing up the squares of numbers from 1 to 10000000. I first made a functional approach to solve the problem with the map and reduce but as the numbers get bigger I have to cast the numbers to bigint to get the correct output.
long num = 10000001;
BigInt result;
result = iota(1,num).map!(a => to!BigInt(a * a)).reduce!((a,b) => (a + b));
writeln("The sum is : ", result);
The above takes 7s to finish when compiled with dmd -O . I profiled the program and most of the time is wasted on BigInt calls. Though the square of the number can fit into a long I have to typecast them to bigint so that reduce function sums and returns the appropriate sum. The python program takes only 3 seconds to finish it. When num = 100000000 D program gets to 1 minute and 13 seconds to finish. Is there a way to optimize the calls to bigint. The products can themselves be long but they have to be typecasted as bigint objects so that they give right results from reduce operations. I tried pushing the square of the numbers into a bigint array but its also slower. I tried to typecast all the numbers as Bigint
auto bigs_map_nums = iota(1,num).map!(a => to!BigInt(a)).array;
auto bigs_map = sum(bigs_map_nums.map!(a => (a * a)).array);
But its also slower. I read the answers at How to optimize this short factorial function in scala? (Creating 50000 BigInts). Is it a problem with the implementation of multiplication for bigger integers in D too ? Is there a way to optimize the function calls to BigInt ?
python code :
timeit.timeit('print sum(map(lambda num : num * num, range(1,10000000)))',number=1)
333333283333335000000
3.58552622795105
The code was executed on a dual-core 64 bit linux laptop with 2 GB RAM.
python : 2.7.4
dmd : DMD64 D Compiler v2.066.1
Without range coolness: foreach(x; 0 .. num) result += x * x;
With range cool(?)ness:
import std.functional: reverseArgs;
result = iota(1, num)
.map!(a => a * a)
.reverseArgs!(reduce!((a, b) => a + b))(BigInt(0) /* seed */);
The key is to avoid BigInting every element, of course.
The range version is a little slower than the non-range one. Both are significantly faster than the python version.
Edit: Oh! Oh! It can be made much more pleasant with std.algorithm.sum:
result = iota(1, num)
.map!(a => a * a)
.sum(BigInt(0));
The python code is not equivalent to the D code, in fact it does a lot less.
Python uses an int, then it promotes that int to long() when the result is bigger than what can be stored in an int() type. Internally, (at least CPython) uses a long number to store integer bigger than 256, which is at least 32bits. Up until that overflow normal cpu instructions can be used for the multiplication which are quite faster than bigint multiplication.
D's BigInt implementation treats the numbers as BigInt from the start and uses the expensive multiplication operation from 1 until the end. Much more work to be done there.
It's interesting how complicated the multiplication can be when we talk about BigInts.
The D implementation is
https://github.com/D-Programming-Language/phobos/blob/v2.066.1/std/internal/math/biguintcore.d#L1246
Python starts by doing
static PyObject *
int_mul(PyObject *v, PyObject *w)
{
long a, b;
long longprod; /* a*b in native long arithmetic */
double doubled_longprod; /* (double)longprod */
double doubleprod; /* (double)a * (double)b */
CONVERT_TO_LONG(v, a);
CONVERT_TO_LONG(w, b);
/* casts in the next line avoid undefined behaviour on overflow */
longprod = (long)((unsigned long)a * b);
... //check if we have overflowed
{
const double diff = doubled_longprod - doubleprod;
const double absdiff = diff >= 0.0 ? diff : -diff;
const double absprod = doubleprod >= 0.0 ? doubleprod :
-doubleprod;
/* absdiff/absprod <= 1/32 iff
32 * absdiff <= absprod -- 5 good bits is "close enough" */
if (32.0 * absdiff <= absprod)
return PyInt_FromLong(longprod);
else
return PyLong_Type.tp_as_number->nb_multiply(v, w);
}
}
and if the number is bigger than what a long can hold it does a karatsuba multiplication. Implementation in :
http://svn.python.org/projects/python/trunk/Objects/longobject.c (k_mul function)
The equivalent code would wait to use BigInts until they are no native data types that can hold the number in question.
DMD's backend does not emit highly optimized code. For fast programs, compile with GDC or LDC.
On my computer, I get these timings:
Python: 3.01
dmd -O -inline -release: 3.92
ldmd2 -O -inline -release: 2.14
I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844
Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.