Numpy is a library for efficient numerical arrays.
mpmath, when backed by gmpy, is a library for efficient multiprecision numbers.
How do I put them together efficiently? Or is it already efficient to just use a Numpy array with mpmath numbers?
It doesn't make sense to ask for "as efficient as native floats", but you can ask for it to be close to the efficiency of equivalent C code (or, failing that, Java/C# code). In particular, an efficient array of multi-precision numbers would mean that you can do vectorized operations and not have to look up, say, __add__ a million times in the Global Interpreter.
Edit: To the close voter: My question is about an efficient way of putting them together. The answer in the possible duplicate specifically points out that the naive approach is not efficient.
Having a numpy array of dtype=object can be a liitle misleading, because the powerful numpy machinery that makes operations with the standard dtypes super fast, is now taken care of by the default object's python operators, which means that the speed will not be there anymore
Disclaimer: I maintain gmpy2. The following tests were performed with the development version.
a and b are 1000 element lists containing pseudo-random gmpy2.mpfr values with 250 bits of of precision. The test performs element-wise multiplication of the two lists.
The first test uses a list comprehension:
%timeit [x*y for x,y in zip(a,b)]
1000 loops, best of 3: 322 µs per loop
The second test uses the map function to perform the looping:
%timeit list(map(gmpy2.mul, a, b))
1000 loops, best of 3: 299 µs per loop
The third test is a C implementation of the list comprehension:
%timeit vector2(a,b)
1000 loops, best of 3: 243 µs per loop
In the third attempt, vector2 tries to be a well-behaved Python function. Numeric types are handled using gmpy2's type conversion rules, error-checking is done, etc. The context settings are checked, subnormal numbers are created if requested, exceptions are raised if needed, etc. If you ignore all the Python enhancements and assume all the value are already gmpy2.mpfr, I was able to get the time down on the fourth attempt:
%timeit vector2(a,b)
10000 loops, best of 3: 200 µs per loop
The fourth version doesn't do enough error checking to be of general use but a version between the third and fourth attempts may be possible.
It is possible to decrease Python overhead, but as the precision increases, the effective savings decreases.
As far as I'm aware, there is no existing Python library that supports vectorized array operations on multiple precision values. There's unfortunately no particularly efficient way to use multiple precision values within a numpy ndarray, and it's extremely unlikely that there ever will be, since multiple precision values are incompatible with numpy's basic array model.
Each element in a floating point numpy ndarray takes up the same number of bytes, so the array can be represented in terms of the memory address of the first element, the dimensions, and a regular byte offset (or stride) between consecutive array elements.
This scheme has significant performance benefits - adjacent array elements are located at adjacent memory addresses, so sequential reads/writes to an array benefit from better locality of reference. Striding is also very important for usability, since it allows you to do things like operating on views of the same array without creating new copies in memory. When you do x[::2], you are really just doubling the stride over the first axis of the array, such that you address every other element.
By contrast, an array containing multiple precision values would have to contain elements of unequal size, since higher precision values would take up more bytes than low-precision values. A multiple precision array therefore cannot be regularly strided, and loses out on the benefits mentioned above.
In addition to the problems with constructing arrays, even plain arithmetic on multiple precision scalars is likely to be much slower than for floating point scalars. Almost all modern processors have specialized floating point units, whereas multiple precision arithmetic must be implemented in software rather than hardware.
I suspect that these performance issues might be a big part of the reason why there isn't already a Python library that provides the functionality you're looking for.
A current project is qd which will be able to embed high-precision numbers in Numpy arrays by taking benefit of the fixed-size in memory of its values. Right now, the type is available for Numpy but not yet as a dtype; you can already use it with an object dtype however.
(If you want to see what the dtype will look like, you may already uncomment the relevant line for compiling it with the Numpy support; it should work for having a glance but no function has been implemented yet; next release should be in september or october.)
Related
I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:
It is vectorized for numpy, meaning that it should have functions for hashing all elements of any N-D numpy array.
It can be applied to any hashable numpy's dtype. For this it is enough for such hash to be able to process just raw block of bytes.
It is very-very fast, same like xxhash. Especially it should be fast for a lot of small inputs, like huge array of 32, 64 bit numbers or short np.str_, but also should handle other dtypes.
It should be collision-resistant. I may use just some part of bits, so any number of bits inside hash should be collision resistant too.
It may be (or may be not) non-crtyptographic, meaning that it is alright if it can be inverted sometimes, like xxhash.
It should produce 64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.
It should itself convert numpy array internally to bytes for hashing to be fast, or at least maybe there is already in numpy such conversion function that converts whole N-D array of any popular dtype to variable sequences of bytes, good if someone will tell me about this.
I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.
I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.
If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.
I need to deal with lots of data (such as float) in my program which costs me much memory. Also, I create some data structures to organize my data which cost memory, too.
Here is the example:
Heap at the end of the function Partition of a set of 6954910 objects. Total size = 534417168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 3446006 50 248112432 46 248112432 46 array.array
1 1722999 25 124055928 23 372168360 70 vertex.Vertex
2 574705 8 82894088 16 455062448 85 list
.......
Any solution?
Python supports array objects that are internally maintained in packed binary arrays of simple data.
For example
import array
a = array.array('f', 0. for x in range(100000))
will create an array object containing 100,000 floats and its size will be approximately just 400Kb (4 bytes per element).
Of course you can store only values of the specific type in an array object, not any Python value as you would do with regular list objects.
The numpy module extends over this concept and provides you many ways to quickly manipulate multidimensional data structures of this kind (including viewing part of arrays as arrays sharing the same memory, reshaping arrays, performing math and search operations and much more).
If you need to deal with billions of rows of data per day, by far the simplest way to do that is to create a simple indexer script that splits the billions of rows in to small files based on some key (e.g. the first two digits of the IP address in a log file row).
If you need to deal with things like numbers theory, or log files, or something else where you have a lot of ints or floats:
1) Learn to use Numpy arrays well
2) Start using Numba's just-in-time compiling
3) Learn Cython (you can do much more than with Numba)
At least moderate level linux skills is a huge plus in dealing with large sets of data. Some things take seconds to do directly from command line, while it might not be at all obvious how to do the same thing in python.
At the very least use %timeit to test range of scales leading to your desired scale (e.g. 2,5 billion rows per day). This is a easy way to identify possible performance drops, and reduce size of arrays or other factors accordingly.
Learn more about profiling / performance hacking as soon as you're doing something with data.
To make the point about 'indexer' clear, a very simple example indexer I've created and used for doing a lot of computation on files with billions of rows of data using a $60 per month server.
https://github.com/mikkokotila/indexer
This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Python Numpy Very Large Matrices
I tried numpy.zeros((100k x 100k)) and it returned "array is too big".
Response to comments:
1) I could create 10k x 10k matrix but not 100kx100k and 1milx1mil.
2) The matrix is not sparse.
We can do simple maths to find out. A 1 million by 1 million matrix has 1,000,000,000,000 elements. If each element takes up 4 bytes, it would require 4,000,000,000,000 bytes of memory. That is, 3.64 terabytes.
There are also chances that a given implementation of Python uses more than that for a single number. For instance, just the leap from a float to a double means you'll need 7.28 terabytes instead. (There are also chances that Python stores the number on the heap and all you get is a pointer to it, approximately doubling the footprint, without even taking in account metadata–but that's slippery grounds, I'm always wrong when I talk about Python internals, so let's not dig it too much.)
I suppose numpy doesn't have a hardcoded limit, but if your system doesn't have that much free memory, there isn't really anything to do.
Does your matrix have a lot of zero entries? I suspect it does, few people do dense problems that large.
You can easily do that with a sparse matrix. SciPy has a good set built in. http://docs.scipy.org/doc/scipy/reference/sparse.html
The space required by a sparse matrix grows with the number of nonzero elements, not the dimensions.
Your system probably won't have enough memory to store the matrix in memory, but nowadays you might well have enough terabytes of free disk space. In that case, numpy.memmap would allow you to have the array stored on disk, but appear as if it resides in memory.
However, it's probably best to rethink the problem. Do you really need a matrix this large? Any computations involving it will probably be infeasibly slow, and need to be done blockwise.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.