Obtain lengths of vectors without loading multiple .npy files - python

I have around 2000 .npy files, each representing a 1-dimensional vector of floats with between 100,000 and 1,000,000 entries (both of these numbers will substantially grow in the future). For each file, I would like the length of the vector it contains. The following option would be possible but time consuming:
lengths = [numpy.shape(numpy.load(whatever))[0] for whatever in os.listdir(some_dir)]
Question:
What is the most efficient/fastest way to derive this list of vector lengths? Surely I should be able to work directly from the filesizes- but what is the best way to do this?

Using memmapped files will speed this up considerably.
By memmapping the file numpy only loads the header to get array shapes and datatype, while the actual array data is left on disk until needed.
import numpy as np
# Load files using memmap
data = [np.load(f, mmap_mode='r')) for f in os.listdir(some_dir)]
# Checking your assumptions never hurts
assert (d.ndim == 1 for d in data).all()
lengths = [d.shape[0] for d in data]
edit The reason you need to load the file headers rather than using file size directly is that the header for npy files is not necessarily a fixed length. Although for a single dimensional array without fields or fieldnames it probably won't change (see https://www.numpy.org/devdocs/reference/generated/numpy.lib.format.html).

you probably can try this
import os
fileinfo = os.stats('1darray.npy')
array length
a = os.stat('1darray.npy')
int((a.st_size - 128)/itemsize)
128 is the extra size npy file takes when saved in a directory in the OS. the actual size in bytes of any any numpy array can be found as array.nbytes. So a.st_size - 128 = array.nbytes and array.bytes/array.itemsize = array.size = array lenght
Where itemsize = 2 if array is of type float 16 bit, 4 if type is float 32 bit and 8 if array if of type float 64 bit
Here is a demo
import numpy as np
import os
array = np.arange(12, dtype=np.float64)
print(a.itemsize) # >> gives 8 for float 64 bit
np.save('1darray.npy', array)
a = os.stat('1darray.npy')
length = int((a.st_size - 128)/8) # >> gives 12 which is equal to array.size
so you have to know what is the dtype of saved numpy npy files
Therefore, for your case you might do this
lengths = [(os.stat(whatever).st_size - 128)/8 for whatever in os.listdir(some_dir)]
assuming dtypes of npy arrays is float64

Related

Deterministic method to hash np.array -> int

I am creating a system that stores large numpy arrays in pyarrow.plasma.
I want to give each array a unique, deterministic plasma.ObjectID, np.array is not hashable sadly
My current (broken) approach is:
import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
return x.to_bytes(
(x.bit_length() + 7) // 8, "big"
) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
arr_id = int(arr.sum() / (arr.shape[0]))
oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20
return plasma.ObjectID(oid)
But this can easily fail, for example:
arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))
It also involves summing the array, which could be very slow for large arrays.
Feel free to assume that the dtype of arr will be np.uint or np.int.
I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either
a) cheaper to compute
b) less likely to fail in practice
or, ideally, both!
The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.
def hasharr(arr):
hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
for dim in arr.shape:
hash.update(dim.to_bytes(4, byteorder='big'))
return hash.digest()
Exmaple:
>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"
I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.
I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

File format optimized for sparse matrix exchange

I want to save a sparse matrix of numbers (integers, but it could be floats) to a file for data exchange. For sparse matrix I mean a matrix where a high percentage of values (typically 90%) are equal to 0. Sparse in this case does not relate to the file format but to the actual content of the matrix.
The matrix is formatted in the following way:
col1 col2 ....
row1 int1_1 int1_2 ....
row2 int2_1 .... ....
.... .... .... ....
By using a text file (tab-delimited) the size of the file is 4.2G. Which file format, preferably ubiquitous such as a .txt file, can I use to easily load and save this sparse data matrix? We usually work with Python/R/Matlab, so formats that are supported by these are preferred.
I found the Feather format (which currently does not support Matlab, afaik).
Some comparison on reading and writing, and memory performance in Pandas is provided in this section.
It provides also support for the Julia language.
Edit:
I found that this format in my case uses more disk space than the .txt one, probably to increase performance in I/O. Compressing with zip alleviates the problem but compression during writing seems to not be supported yet.
You have several solutions, but generally what you need to do it output the indices of the non-zero elements as well as the values. Lets assume that you want to export to a single text file.
Generate array
Lets first generate a 10000 x 5000 sparse array with ~10% filled (it will be a bit less due to replicated indices):
N = 10000;
M = 5000;
rho = .1;
rN = ceil(sqrt(rho)*N);
rM = ceil(sqrt(rho)*M);
S = sparse(N, M);
S(randi(N, [rN 1]), randi(M, [rM 1])) = randi(255, rN, rM);
If your array is not stored as a sparse array, you can create it simply using (where M is the full array):
S = sparse(M);
Save as text file
Now we will save the matrix in the following format
row_indx col_indx value
row_indx col_indx value
row_indx col_indx value
This is done by extracting the row and column indices as well as data values and then saving it to a text file in a loop:
[n, m, s] = find(S);
fid = fopen('Sparse.txt', 'wt');
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%d\n', n, m, s), n, m, s);
fclose(fid);
If the underlying data is not an integer, then you can use the %f flag on the last output, e.g. (saved with 15 decimal places)
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%.15f\n', n, m, s), n, m, s);
Compare this to the full array:
fid = fopen('Full.txt', 'wt');
arrayfun(#(n) fprintf(fid, '%s\n', num2str(S(n, :))), (1:N).');
fclose(fid);
In this case, the sparse file is ~50MB and the full file ~170MB representing a factor of 3 efficiency. This is expected since I need to save 3 numbers for every nonzero element of the array, and ~10% of the array is filled, requiring ~30% as many numbers to be saved compared to the full array.
For floating point format, the saving is larger since the size of the indices compared to the floating point value is much smaller.
In Matlab, a quick way to extract the data would be to save the string given by:
mat2str(S)
This is essentially the same but wraps it in the sparse command for easy loading in Matlab - one would need to parse this in other languages to be able to read it in. The command tells you how to recreate the array, implying you may need to store the size of the matrix in the file as well (I recommend doing it in the first line since you can read this in and create the sparse matrix before parsing the rest of the file.
Save as binary file
A much more efficient method is to save as a binary file. Assuming the data and indices can be stored as unsigned 16 bit integers you can do the following:
[n, m, s] = find(S);
fid = fopen('Sparse.dat', 'w');
fwrite(fid, size(S), 'uint16');
fwrite(fid, [n m s], 'uint16');
fclose(fid);
Then to read the data:
fid = fopen('Sparse.dat', 'r');
sz = fread(fid, 2, 'uint16');
s = reshape(fread(fid, 'uint16'), [], 3);
s = sparse(s(:, 1), s(:, 2), s(:, 3), sz(1), sz(2));
fclose(fid);
Now we can check they are equal:
isequal(S, s)
Saving the full array:
fid = fopen('Full.dat', 'w');
fwrite(fid, full(S), 'uint16');
fclose(fid);
Comparing the sparse and full file sizes I get 21MB and 95MB.
A couple of notes:
Using a single write/read command is much (much much) quicker than looping, so the last method is by far the fastest, and also most space efficient.
The maximum index/data value size that can be saved as a binary integer is 2^n - 1, where n is the bitdepth. In my example of 16 bits (uint16), that corresponds to a range of 0..65,535. By the sounds of it, you may need to use 32 bits or even 64 bits just to store the indices.
Higher efficiency can be obtained by saving the indices as one data type (e.g. uint32) and the actual values as another (e.g. uint8). However, this adds additional complexity in the saving and reading.
You will still want to store the matrix size first, as I showed in the binary example.
You can store the values as doubles if required, but indices should always be integers. Again, extra complexity, but doable.

Why is the max element value in a numpy array 255?

I am currently using numpy to create an array. I would like to use vectorized implementations to more efficiently take the average of the elements in a position, (i, j). These arrays are coming from images in a file directory which have all been standardized to a fixed size.
However, when I try to add the image arrays, the sum of each element is returned in a form a (mod 256). How can I change the maximum value of the elements?
Your arrays are presumably of type numpy.uint8, so they wraparound when they hit 256.
If you want to get larger results, use astype to convert the first argument to a larger data type, e.g.:
a = np.array(..., dtype=np.uint8)
b = np.array(..., dtype=np.uint8)
c = a.astype(np.uint32) + b
and you'll get a result array of the larger data type too.
Per #Eric, to avoid the temporary, you can use the numpy add function (not method) to do the addition, passing a dtype so the result is of the new type even as the inputs are not converted, avoiding a temporary (at least at the Python level):
c = np.add(a, b, dtype=np.uint32)
You would be better off creating the output array first:
average = numpy.zeros(a.shape, numpy.float32)
image = numpy.zeros_like(average)
Then traversing the images and adding them up in-place:
for i in images:
image[:] = function_that_reads_images_as_uint8(i)
average += image
average /= len(images)
You might get away with int types if you didn't need the precision in the division step.

How to do elementwise processing (first int, then pairwise absolute length) to avoid memory problems? (python)

I want to process a big list of uint numbers (test1) and I do that i chunks of "length". I need them as signed int and then I need it as absolute length from each pait of even and odd values in this list.
But I want to get rid of two problems:
it uses a a lot of ram
it takes ages!
So how could I make this faster? Any trick? I could also use numpy, no problem in doing so.
Thanks in advance!
test2 = -127 + test1[i:i+length*2048000*2 + 2048000*2*1]
test3 = (test2[::2]**2 + test2[1::2]**2)**0.5
An efficient way is to try to use Numpy functions, e.g:
n = 10
ff = np.random.randint(0, 255, n) # generate some data
ff2 = ff.reshape(n/2, 2) # new view on ff (only makes copy if needed)
l_ff = np.linalg.norm(ff2, axis=1) # calculate vector length of each row
Note that when modifying an entry in ff2 then ff will change as well and vice versa.
Internally, Numpy stores data as contiguous memory blocks. So there are further methods besides np.reshape() to exploit that structure. For efficient conversion of data types, you can try:
dd_s = np.arange(-5, 10, dtype=np.int8)
dd_u = dd_s.astype(np.uint8) # conversion from signed to unsigned

Python Array size is doubled after saving in binary format

I have data stored in an array of size (4320,2160), reshaped from a list of length 4320*2160. When I save the file in binary format using numpy's tofile method, and then open the file I noticed that the array is double in length. How do I get the original values of the array? I'm assuming it has something to do with endianness, but i'm unfamiliar with dealing with it.
cdom=np.reshape(cdom, (4320,2160), order='F') # array of float values
len(cdom) # 4320*2160
cdom.tofile(filename)
arr = np.fromfile(filename, dtype=np.float32)
len(arr) # double the size of cdom: 2*4320*2160
It looks like cdom has type np.float64, and you are reading the binary file as np.float32, so the length is doubled (and the values are effectively garbage).

Categories