I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:
It is vectorized for numpy, meaning that it should have functions for hashing all elements of any N-D numpy array.
It can be applied to any hashable numpy's dtype. For this it is enough for such hash to be able to process just raw block of bytes.
It is very-very fast, same like xxhash. Especially it should be fast for a lot of small inputs, like huge array of 32, 64 bit numbers or short np.str_, but also should handle other dtypes.
It should be collision-resistant. I may use just some part of bits, so any number of bits inside hash should be collision resistant too.
It may be (or may be not) non-crtyptographic, meaning that it is alright if it can be inverted sometimes, like xxhash.
It should produce 64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.
It should itself convert numpy array internally to bytes for hashing to be fast, or at least maybe there is already in numpy such conversion function that converts whole N-D array of any popular dtype to variable sequences of bytes, good if someone will tell me about this.
I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.
I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.
If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.
Related
Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)
The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.
Numpy is a library for efficient numerical arrays.
mpmath, when backed by gmpy, is a library for efficient multiprecision numbers.
How do I put them together efficiently? Or is it already efficient to just use a Numpy array with mpmath numbers?
It doesn't make sense to ask for "as efficient as native floats", but you can ask for it to be close to the efficiency of equivalent C code (or, failing that, Java/C# code). In particular, an efficient array of multi-precision numbers would mean that you can do vectorized operations and not have to look up, say, __add__ a million times in the Global Interpreter.
Edit: To the close voter: My question is about an efficient way of putting them together. The answer in the possible duplicate specifically points out that the naive approach is not efficient.
Having a numpy array of dtype=object can be a liitle misleading, because the powerful numpy machinery that makes operations with the standard dtypes super fast, is now taken care of by the default object's python operators, which means that the speed will not be there anymore
Disclaimer: I maintain gmpy2. The following tests were performed with the development version.
a and b are 1000 element lists containing pseudo-random gmpy2.mpfr values with 250 bits of of precision. The test performs element-wise multiplication of the two lists.
The first test uses a list comprehension:
%timeit [x*y for x,y in zip(a,b)]
1000 loops, best of 3: 322 µs per loop
The second test uses the map function to perform the looping:
%timeit list(map(gmpy2.mul, a, b))
1000 loops, best of 3: 299 µs per loop
The third test is a C implementation of the list comprehension:
%timeit vector2(a,b)
1000 loops, best of 3: 243 µs per loop
In the third attempt, vector2 tries to be a well-behaved Python function. Numeric types are handled using gmpy2's type conversion rules, error-checking is done, etc. The context settings are checked, subnormal numbers are created if requested, exceptions are raised if needed, etc. If you ignore all the Python enhancements and assume all the value are already gmpy2.mpfr, I was able to get the time down on the fourth attempt:
%timeit vector2(a,b)
10000 loops, best of 3: 200 µs per loop
The fourth version doesn't do enough error checking to be of general use but a version between the third and fourth attempts may be possible.
It is possible to decrease Python overhead, but as the precision increases, the effective savings decreases.
As far as I'm aware, there is no existing Python library that supports vectorized array operations on multiple precision values. There's unfortunately no particularly efficient way to use multiple precision values within a numpy ndarray, and it's extremely unlikely that there ever will be, since multiple precision values are incompatible with numpy's basic array model.
Each element in a floating point numpy ndarray takes up the same number of bytes, so the array can be represented in terms of the memory address of the first element, the dimensions, and a regular byte offset (or stride) between consecutive array elements.
This scheme has significant performance benefits - adjacent array elements are located at adjacent memory addresses, so sequential reads/writes to an array benefit from better locality of reference. Striding is also very important for usability, since it allows you to do things like operating on views of the same array without creating new copies in memory. When you do x[::2], you are really just doubling the stride over the first axis of the array, such that you address every other element.
By contrast, an array containing multiple precision values would have to contain elements of unequal size, since higher precision values would take up more bytes than low-precision values. A multiple precision array therefore cannot be regularly strided, and loses out on the benefits mentioned above.
In addition to the problems with constructing arrays, even plain arithmetic on multiple precision scalars is likely to be much slower than for floating point scalars. Almost all modern processors have specialized floating point units, whereas multiple precision arithmetic must be implemented in software rather than hardware.
I suspect that these performance issues might be a big part of the reason why there isn't already a Python library that provides the functionality you're looking for.
A current project is qd which will be able to embed high-precision numbers in Numpy arrays by taking benefit of the fixed-size in memory of its values. Right now, the type is available for Numpy but not yet as a dtype; you can already use it with an object dtype however.
(If you want to see what the dtype will look like, you may already uncomment the relevant line for compiling it with the Numpy support; it should work for having a glance but no function has been implemented yet; next release should be in september or october.)
This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.
Doing some exercises with simple file encryption/decryption and am currently just reading in a bunch of bytes and performing the appropriate bit-operations on each byte one at a time, then writing them to the output file.
This method seems pretty slow. For example, if I want to XOR every byte by 0xFF, I would loop over each byte and XOR by 0xFF, rather than do some magic and every byte is XOR'd quickly.
Are there better ways to perform bit operations rather than a byte at a time?
Using the bitwise array operations from numpy may be what you're looking for.
No matter what, it appears that each byte would have to be
read from memory,
modified in some fashion, and
written back to memory.
You can save a bit (no pun intended) of time by operating on multiple bytes at a time, for example by performing the XOR operation on 4, or even 8 bytes integers, hence dividing the overhead associated with the management of the loop by, roughly, a factor of 4 or 8, but this improvement would likely not amount to a significant gain for the overall algorithm.
Additional improvements can be found by replacing the "native" bit operations (XOR, Shifts, Rotations and the like) of the CPU/Language by reading pre-computed values in a table. Beware however that these native operations are typically rather optimized, and that you must be very diligent in designing the equivalent operations externally, and in measuring precisely the relative performance of these operations.
Edit: Oops, I just noted the [Python] tag, and also the reference to numpy in another response.
Beware... while the Numpy bitwise array suggestion is plausible, it all depends on actual parameters of the problem at hand. For example a fair amount of time may be lost in lining up the underlying arrays implied by numpy's bitwise function.
See this Stack Overflow question which seems quite relevant. While focused on the XOR operation, this question provides quite a few actionable hints for both improving loops etc. and for profiling in general.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Python Numpy Very Large Matrices
I tried numpy.zeros((100k x 100k)) and it returned "array is too big".
Response to comments:
1) I could create 10k x 10k matrix but not 100kx100k and 1milx1mil.
2) The matrix is not sparse.
We can do simple maths to find out. A 1 million by 1 million matrix has 1,000,000,000,000 elements. If each element takes up 4 bytes, it would require 4,000,000,000,000 bytes of memory. That is, 3.64 terabytes.
There are also chances that a given implementation of Python uses more than that for a single number. For instance, just the leap from a float to a double means you'll need 7.28 terabytes instead. (There are also chances that Python stores the number on the heap and all you get is a pointer to it, approximately doubling the footprint, without even taking in account metadata–but that's slippery grounds, I'm always wrong when I talk about Python internals, so let's not dig it too much.)
I suppose numpy doesn't have a hardcoded limit, but if your system doesn't have that much free memory, there isn't really anything to do.
Does your matrix have a lot of zero entries? I suspect it does, few people do dense problems that large.
You can easily do that with a sparse matrix. SciPy has a good set built in. http://docs.scipy.org/doc/scipy/reference/sparse.html
The space required by a sparse matrix grows with the number of nonzero elements, not the dimensions.
Your system probably won't have enough memory to store the matrix in memory, but nowadays you might well have enough terabytes of free disk space. In that case, numpy.memmap would allow you to have the array stored on disk, but appear as if it resides in memory.
However, it's probably best to rethink the problem. Do you really need a matrix this large? Any computations involving it will probably be infeasibly slow, and need to be done blockwise.