Large Numpy array handler, numpy data procession, memmap funciton mapping - python

Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)

The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.

Related

Overhead of loading large numpy arrays

My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.

NumPy - fastest non-cryptographic collision-resistant hash

I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:
It is vectorized for numpy, meaning that it should have functions for hashing all elements of any N-D numpy array.
It can be applied to any hashable numpy's dtype. For this it is enough for such hash to be able to process just raw block of bytes.
It is very-very fast, same like xxhash. Especially it should be fast for a lot of small inputs, like huge array of 32, 64 bit numbers or short np.str_, but also should handle other dtypes.
It should be collision-resistant. I may use just some part of bits, so any number of bits inside hash should be collision resistant too.
It may be (or may be not) non-crtyptographic, meaning that it is alright if it can be inverted sometimes, like xxhash.
It should produce 64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.
It should itself convert numpy array internally to bytes for hashing to be fast, or at least maybe there is already in numpy such conversion function that converts whole N-D array of any popular dtype to variable sequences of bytes, good if someone will tell me about this.
I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.
I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.
If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.

How can I sort 128 bit unsigned integers in Python?

I have a huge number of 128-bit unsigned integers that need to be sorted for analysis (around a trillion of them!).
The research I have done on 128-bit integers has led me down a bit of a blind alley, numpy doesn't seem to fully support them and the internal sorting functions are memory intensive (using lists).
What I'd like to do is load, for example, a billion 128-bit unsigned integers into memory (16GB if just binary data) and sort them. The machine in question has 48GB of RAM so should be OK to use 32GB for the operation. If it has to be done in smaller chunks that's OK, but doing as large a chunk as possible would be better. Is there a sorting algorithm that Python has which can take such data without requiring a huge overhead?
I can sort 128-bit integers using the .sort method for lists, and it works, but it can't scale to the level that I need. I do have a C++ version that was custom written to do this and works incredibly quickly, but I would like to replicate it in Python to accelerate development time (and I didn't write the C++ and I'm not used to that language).
Apologies if there's more information required to describe the problem, please ask anything.
NumPy doesn't support 128-bit integers, but if you use a structured dtype composed of high and low unsigned 64-bit chunks, those will sort in the same order as the 128-bit integers would:
arr.sort(order=['high', 'low'])
As for how you're going to get an array with that dtype, that depends on how you're loading your data in the first place. I imagine it might involve calling ndarray.view to reinterpret the bytes of another array. For example, if you have an array of dtype uint8 whose bytes should be interpreted as little-endian 128-bit unsigned integers, on a little-endian machine:
arr_structured = arr_uint8.view([('low', 'uint64'), ('high', 'uint64')])
So that might be reasonable for a billion ints, but you say you've got about a trillion of these. That's a lot more than an in-memory sort on a 48GB RAM computer can handle. You haven't asked for something to handle the whole trillion-element dataset at once, so I hope you already have a good solution in mind for merging sorted chunks, or for pre-partitioning the dataset.
I was probably expecting too much from Python, but I'm not disappointed. A few minutes of coding allowed me to create something (using built-in lists) that can process the sorting a hundred million uint128 items on an 8GB laptop in a couple of minutes.
Given a large number of items to be sorted (1 trillion), it's clear that putting them into smaller bins/files upon creation makes more sense than looking to sort huge numbers in memory. The potential issues created by appending data to thousands of files in 1MB chunks (fragmentation on spinning disks) are less of a worry due to the sorting of each of these fragmented files creating a sequential file that will be read many times (the fragmented file is written once and read once).
The benefits of development speed of Python seem to outweigh the performance hit versus C/C++, especially since the sorting happens only once.

What is the maximum tuple array size in Python 3?

I am building a web scraper that stores data retrieved from four different websites into a tuple array. I later iterate through the tuple and save the entire lot as both CSV and Excel.
Are tuple arrays or arrays in general, limited to the processor's RAM/disc-space?
Thanks
According to the doc, this is given by sys.maxsize
sys.maxsize
An integer giving the maximum value a variable of type Py_ssize_t can
take. It’s usually 2**31 - 1 on a 32-bit platform and 2**63 - 1 on a
64-bit platform.
And interestingly enough, this the Python3 doc about data model gives more implementation details under object.__len__.
CPython implementation detail: In CPython, the length is required to
be at most sys.maxsize. If the length is larger than sys.maxsize
some features (such as len()) may raise OverflowError.
I believe tuples and lists are limited by the size of the machine's virtual memory, unless you're on a 32 bit system in which case you're limited by the small word size. Also, lists are dynamically resized by... I believe about 12% each time they grow too small, so there's a little overhead there as well.
If you're concerned you're going to run out of virtual memory, it might be a good idea to write to a file or files instead.

Numpy octuple precision floats and 128 bit ints. Why and how?

This is mostly a question out of curiosity. I noticed that the numpy test suite contains tests for 128 bit integers, and the numerictypes module refers to int128, float256 (octuple precision?), and other types that don't seem to map to numpy dtypes on my machine.
My machine is 64bit, yet I can use quadruple 128bit floats (but not really). I suppose that if it's possible to emulate quadruple floats in software, one can theoretically also emulate octuple floats and 128bit ints. On the other hand, until just now I had never heard of either 128bit ints or octuple precision floating point before. Why is there a reference to 128bit ints and 256bit floats in numpy's numerictypes module if there are no corresponding dtypes, and how can I use those?
This is a very interesting question and probably there are reasons related to python, to computing and/or to hardware. While not trying to give a full answer, here is what I would go towards...
First note that the types are defined by the language and can be different from your hardware architecture. For example you could even have doubles with an 8-bits processor. Of course any arithmetic involves multiple CPU instructions, making the computation much slower. Still, if your application requires it, it might be worth it or even required (better being late than wrong, especially if say you are running a simulation for a say bridge stability...) So where is 128bit precision required? Here's the wikipedia article on it...
One more interesting detail is that when we say a computer is say 64-bit, this is not fully describing the hardware. There are a lot of pieces that can each be (and at least have been at times) different bits: The computational registers in the CPU, the memory addressing scheme / memory registers and the different buses with most important the buss from CPU to memory.
-The ALU (arithmetic and logic unit) has registers that do calculations. Your machines are 64bit (not sure if that also mean they could do 2 32bit calculations at a similar time) This is clearly the most relevant quantity for this discussion. Long time ago, it used to be the case you could go out and buy a co-processor to speed that for calculations of higher precision...
-The Registers that hold memory addresses limit the memory the computer can see (directly) that is why computers that had 32bit memory registers could only see 2^32 bytes (or approx 4 GB) Notice that for 16bits, this becomes 65K which is very low. The OS can find ways around this limit, but not for a single program, so no program in a 32bit computer can normally have more than 4GB memmory.
-Notice that those limits are about bytes, not bits. That is because when referring and loading from memory we load bytes. In fact, the way this is done, loading a byte (8 bits) or 8 (64 bits == buss length for your computer) takes the same time. I ask for an address, and then get at once all bits through the bus.
It can be that in an architecture all these quantities are not the same number of bits.
NumPy is amazingly powerful and can handle numbers much bigger than the internal CPU representation (e.g. 64 bit).
In case of dynamic type it stores the number in an array. It can extend the memory block too, that is why you can have an integer with 500 digits. This dynamic type is called bignum. In older Python versions it was the type long. In newer Python (3.0+) there is only long, which is called int, which supports almost arbitrarily number of digits (-> bignum).
If you specify a data type (int32 for example), then you specify bit length and bit format, i.e. which bits in memory stands for what. Example:
dt = np.dtype(np.int32) # 32-bit integer
dt = np.dtype(np.complex128) # 128-bit complex floating-point number
Look in: https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

Categories