How to use np.unique on big arrays?

How to use np.unique on big arrays? - python

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).
My idea was to use the unique method of numpy to retrieve the information from these patches as follows:
# identify the clumps
with rio.open(mask) as f:
mask_raster = f.read(1)
class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True)
del mask_raster
# identify the value
with rio.open(src) as f:
src_raster = f.read(1)
src_flat = src_raster.flatten()
del src_raster
values = [src_flat[index] for index in indices]
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
My problem is this:
For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:
Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64
Is it normal that unique reformats my painting in int64?
Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
Is there another solution using a function I don't know?

The uint64 array is probably allocated during argsort here in the source code.
Since the labels from scipy.ndimage.label are consecutive integers starting at zero you can use numpy.bincount:
num_features = np.max(mask_raster)
count = np.bincount(mask_raster, minlength=num_features+1)
To get values from src you can do the following assignment. It's really inefficient but I don't think it allocates too much memory.
values = np.zeros(num_features+1, dtype=src_raster.dtype)
values[mask_raster] = src_raster
Maybe scipy.ndimage has a function that better suits the use case.

I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.

I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion.
As it's slicing the initial raster is faster than I thought :
from scipy import ndimage
import numpy as np
# open the files
with rio.open(mask) as f_mask, rio.open(src) as f_src:
mask_raster = f_mask.read(1)
src_raster = f_src.read(1)
# use patches as slicing material
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
# the value of the patch is the value with the highest count
idx = np.argmax(loc_counts)
counts.append(loc_counts[idx])
values.append(loc_values[idx])
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

Related

How to access random indices from h5 data set?

I have some h5 data that I want to sample from by using some randomly generated indices. However, if the indices are out of increasing order, then the effort fails. Is it possible to select indices, that have been generated randomly, from h5 data sets?
Here is a MWE citing the error:
import h5py
import numpy as np
arr = np.random.random(50).reshape(10,5)
with h5py.File('example1.h5', 'w') as h5fw:
h5fw.create_dataset('data', data=arr)
random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
# TypeError: Indexing elements must be in increasing order
I could sort the indices, but then we lose the randomness component.

As hpaulj mentioned, random indices aren't a problem for numpy arrays in memory. So, yes it's possible to select data with randomly generated indices from h5 data sets read to numpy arrays. The key is having sufficient memory to hold the dataset in memory. The code below shows how to do this:
#random_subset = h5py.File('example1.h5', 'r')['data'][[3, 1]]
arr = h5py.File('example1.h5', 'r')['data'][:]
random_subset = arr[[3,1]]

A potential solution is to pre-sort the desired indices as follow:
idx = np.sort([3,1])
random_subset = h5py.File('example1.h5', 'r')['data'][idx]

Calculate partitioned sum efficiently with CuPy or NumPy

I have a very long array* of length L (let's call it values) that I want to sum over, and a sorted 1D array of the same length L that contains N integers with which to partition the original array – let's call this array labels.
What I'm currently doing is this (module being cupy or numpy):
result = module.empty(N)
for i in range(N):
result[i] = values[labels == i].sum()
But this can't be the most efficient way of doing it (it should be possible to get rid of the for loop, but how?). Since labels is sorted, I could easily determine the break points and use those indices as start/stop points, but I don't see how this solves the for loop problem.
Note that I would like to avoid creating an array of size NxL along the way, if possible, since L is very large.
I'm working in cupy, but any numpy solution is welcome too and could probably be ported. Within cupy, it seems this would be a case for a ReductionKernel, but I don't quite see how to do it.
* in my case, values is 1D, but I assume the solution wouldn't depend on this

You are describing a groupby sum aggregation. You could write a CuPy RawKernel for this, but it would be much easier to use the existing groupby aggregations implemented in cuDF, the GPU dataframe library. They can interoperate without requiring you to copy the data. If you call .values on the resulting cuDF Series, it will give you a CuPy array.
If you went back to the CPU, you could do the same thing with pandas.
import cupy as cp
import pandas as pd
N = 100
values = cp.random.randint(0, N, 1000)
labels = cp.sort(cp.random.randint(0, N, 1000))
L = len(values)
result = cp.empty(L)
for i in range(N):
result[i] = values[labels == i].sum()
result[:5]
array([547., 454., 402., 601., 668.])
import cudf
df = cudf.DataFrame({"values": values, "labels": labels})
df.groupby(["labels"])["values"].sum().values[:5]
array([547, 454, 402, 601, 668])

Here is a solution which, instead of a N x L array, uses a N x <max partition size in labels> array (which should not be large, if the disparity between different partitions is not too high):
Resize the array into a 2-D array with partitions in each row. Since the length of the row equals the size of the maximum partition, fill unavailable values with zeros (since it doesn't affect any sum). This uses #Divakar's solution given here.
def jagged_to_regular(a, parts):
lens = np.ediff1d(parts,to_begin=parts[0])
mask = lens[:,None]>np.arange(lens.max())
out = np.zeros(mask.shape, dtype=a.dtype)
out[mask] = a
return out
parts_stack = jagged_to_regular(values, labels)
Sum along axis 1:
result = np.sum(parts_stack, axis = 1)
In case you'd like a CuPy implementation, there's no direct CuPy alternative to numpy.ediff1d in jagged_to_regular. In that case, you can substitute the statement with numpy.diff like so:
lens = np.insert(np.diff(parts), 0, parts[0])
and then continue to use CuPy as a drop-in replacement for numpy.

Initializing or populating multiple numpy arrays from h5 file groups

I have an h5 file with 5 groups, each group containing a 3D dataset. I am looking to build a for loop that allows me to extract each group into a numpy array and assign the numpy array to an object with the group header name. I am able to get a number of different methods to work with one group, but when I try to build a for loop that applies to code to all 5 groups, it breaks. For example:
import h5py as h5
import numpy as np
f = h5.File("FFM0012.h5", "r+") #read in h5 file
print(list(f.keys())) #['FFM', 'Image'] for my dataset
FFM = f['FFM'] #Generate object with all 5 groups
print(list(FFM.keys())) #['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr'] for my dataset
Amp = FFM['Amp'] #Generate object for 1 group
Amp = np.array(Amp) #Turn into numpy array, this works.
Now when I try to apply the same logic with a for loop:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names ['Amp', 'Drive', 'Phase', 'Raw', 'Zsnsr']
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
When I run this code I get "NameError: name 'Amp' is not defined". I've tried initializing the numpy array before the for loop with:
h5_keys = []
FFM.visit(h5_keys.append) #Create list of group names
Amp = np.array([])
for h5_key in h5_keys:
tmp = FFM[h5_key]
h5_key = np.array(tmp)
print(Amp[30,30,30]) #To check that array is populated
This produces the error message "IndexError: too many indices for array"
I've also tried generating a dictionary and creating numpy arrays from the dictionary. That is a similar story where I can get the code to work for one h5 group, but it falls apart when I build the for loop.
Any suggestions are appreciated!

You seem to have jumped to using h5py and numpy before learning much of Python
Amp = np.array([]) # creates a numpy array with 0 elements
for h5_key in h5_keys: # h5_key is set of a new value each iteration
tmp = FFM[h5_key]
h5_key = np.array(tmp) # now you reassign h5_key
print(Amp[30,30,30]) # Amp is the original (0,) shape array
Try this basic python loop, paying attention to the value of i:
alist = [1,2,3]
for i in alist:
print(i)
i = 10
print(i)
print(alist) # no change to alist
f is the file.
FFM = f['FFM']
is a group
Amp = FFM['Amp']
is a dataset. There are various ways of load the dataset into an numpy array. I believe the [...] slicing is the current preferred one. .value used to used but is now deprecated (loading dataset)
Amp = FFM['Amp'][...]
is an array.
alist = [FFM[key][...] for key in h5_keys]
should create a list of arrays from the FFM group.
If the shapes are compatible, you can concatenate the arrays into one array:
np.array(alist)
np.stack(alist)
np.concatatenate(alist, axis=0) # or other axis
etc
adict = {key: FFM[key][...] for key in h5_keys}
should crate of dictionary of array keyed by dataset names.
In Python, lists and dictionaries are the ways of accumulating objects. The h5py groups behave much like dictionaries. Datasets behave much like numpy arrays, though they remain on the disk until loaded with [...].

Construct sparse matrix on disk on the fly in Python

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M). I'm constructing this matrix column by column when reading a corpus of 5M documents. For this purpose I use a sparse dok_matrix data structure from SciPy. However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes. What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn, but, as said, it is impossible to hold and construct the entire matrix in memory. I've looked into numpy.memmap, as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.
It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).
My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?
Thanks!

We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk. I'll show you a small simple example of how I would deal with this. My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once. This will work even if you can't fit one entire copy.
I would construct an on disk sparse CSC matrix column by column. A sparse csc matrix uses 3 underlying arrays:
data: the values stored in the matrix
indices: the row index for each value in the matrix
indptr: an array of length n_cols + 1, which divides indices and data by which column they belong to.
As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data. Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]].
To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.
import numpy as np
import h5py
from scipy import sparse
from tqdm import tqdm # For monitoring the writing process
from typing import Tuple, Union # Just for argument annotation
def process_document():
"""
Simulate processing a document. Results in sparse vector represenation.
"""
n_items = np.random.negative_binomial(2, .0001)
indices = np.random.choice(2_000_000, n_items, replace=False)
indices.sort()
data = np.random.random(n_items).astype(np.float32)
return indices, data
def data_generator(n):
"""Iterator which yields simulated data."""
for i in range(n):
yield process_document()
Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.
def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
"""
Create a group in an hdf5 file that can store a CSC sparse matrix.
"""
g = f.create_group(groupname)
g.attrs["shape"] = shape
g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
return g
And finally a function for reading this group as a sparse matrix (this one is pretty simple).
def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])
Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).
N_COLS = 10
def make_disk_matrix(f, groupname, data_iter, shape):
group = make_sparse_csc_group(f, "mtx", shape)
indptr = group["indptr"]
data = group["data"]
indices = group["indices"]
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices.resize((n_total,))
data.resize((n_total,))
indices[n_prev:] = cur_indices
data[n_prev:] = cur_data
indptr[doc_num+1] = n_total
# Writing
with h5py.File("data.h5", "w") as f:
make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))
# Reading
with h5py.File("data.h5", "r") as f:
mtx = read_sparse_csc_group(f["mtx"])
Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it. A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions). However, using a slight modification of this code should give you better performance:
def make_memory_mtx(data_iter, shape):
indices_list = []
data_list = []
indptr = np.zeros(shape[1]+1, dtype=int)
n_total = 0
for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
n_cur = len(cur_indices)
n_prev = n_total
n_total += n_cur
indices_list.append(cur_indices)
data_list.append(cur_data)
indptr[doc_num+1] = n_total
indices = np.concatenate(indices_list)
data = np.concatenate(data_list)
return sparse.csc_matrix((data, indices, indptr), shape=shape)
mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))
This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays. Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.

It would be great if you could provide a minimal working code. I can't see if your matrix gets too big by construction (1) or just because you have too much data (2). If you don't really care about building this matrix yourself, you can directly look at my remark 2.
For problem (1), in the example code below, I made a wrapper class to build a csr_matrix chunk by chunk. The idea is to just add (row,column,data) tuples of lists until a buffer limit (see remark 1) is reached, and actually update the matrix at this moment. When the limit is reached, it will reduce the data in memory since the csr_matrix constructor adds data that have the same (row,column) tuples. This part only allows you to construct the sparse matrix in a fast manner (much faster than creating a sparse matrix for each row) and avoids memory error due to the redundancy of the (row,column) when a word appears several times in a document.
import numpy as np
import scipy.sparse
class SparseMatrixBuilder():
def __init__(self, shape, build_size_limit):
self.sparse_matrix = scipy.sparse.csr_matrix(shape)
self.shape = shape
self.build_size_limit = build_size_limit
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def add(self, data, col_indices, row_indices):
self.data_temp.append(data)
self.col_indices_temp.append(col_indices)
self.row_indices_temp.append(row_indices)
if len(self.data_temp) == self.build_size_limit:
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
def get_matrix(self):
self.sparse_matrix += scipy.sparse.csr_matrix(
(np.concatenate(self.data_temp),
(np.concatenate(self.col_indices_temp),
np.concatenate(self.row_indices_temp))),
shape=self.shape
)
self.data_temp = []
self.col_indices_temp = []
self.row_indices_temp = []
return self.sparse_matrix
For problem (2), you can easily extend this class by adding a save method that stores the matrix on disk once the limit (or a second limit) is reached. As such, you'll end up with multiple chunks of sparse matrices on disk. Then you'll need a dimensionality reduction algorithm that can handle chunked matrices (see remark 2).
remark 1: the buffer limit here is not really well defined. It would be better to check for the actual size of the numpy arrays data_temp, col_indices_temp and row_indices_temp compared to the RAM available on the machine (which is quite easy to automatize with python).
remark 2: gensim is a python library that has the great advantage to use chunked files for building NLP models. So you could build a dictionary, construct a sparse matrix and reduce it dimension with that library, without much RAM needed.

I'm assuming that all your data can fit in memory using a more memory-friendly sparse matrix format such as COO. If it does not, there is almost no hope you will be able to proceed with sklearn, even by using mmap. Indeed sklearn will likely create subsequent objects with memory requirements of the same order of magnitude as your input.
Scipy's dok_matrix are actually a sub-class of the vanilla dict. They store the data using individual python objects and tons of pointers, so they are not memory efficient. The most compact representation is the coo_matrix format. You can incrementally build the data required to create a COO matrix by pre-allocating arrays for the coordinates (rows and cols) and the data; and eventually increase these buffers if your initial guess was wrong.
def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
counter = 0
rows = numpy.empty(n_data_hint, dtype=idx_dtype)
cols = numpy.empty(n_data_hint, dtype=idx_dtype)
data = numpy.empty(n_data_hint, dtype=data_dtype)
for row, col, value in iterable:
if counter >= n_data_hint:
n_data_hint *= 2
rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
rows[counter] = row
cols[counter] = col
data[counter] = value
counter += 1
rows = rows[:counter]
cols = cols[:counter]
data = data[:counter]
return coo_matrix((data, (rows, cols)))
def _reallocate(rows, cols, data, n):
new_rows = numpy.empty(n, dtype=rows.dtype)
new_cols = numpy.empty(n, dtype=cols.dtype)
new_data = numpy.empty(n, dtype=data.dtype)
new_rows[:rows.size] = rows
new_cols[:cols.size] = cols
new_data[:data.size] = data
return new_rows, new_cols, new_data
which you can test with randomly-generated data like this:
def get_random_data(n, max_row=2000, max_col=5000):
for _ in range(n):
row = numpy.random.choice(max_row)
col = numpy.random.choice(max_col)
val = numpy.random.randn()
yield row, col, val
# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)
# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)
Once you have your COO matrix, you may want to convert to CSR using coo.tocsr(). The CSR matrices are more optimized for common operations such as dot product.
It requires a bit more memory in the case where some rows were empty originally. This is because it stores pointers for all rows even empty ones.

Look at here, at the end he explain how to store and read directly sparse matrix to a Hdf5 file.

doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.
So, I have a random array of unique integers (normally millions of elements).
totalIDs = [5,4,3,1,2,9,7,6,8 ...]
I have another array (normally a tens of thousands) of unique integers which I can create a mask.
subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...
I can use numpy to do
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
I can then extract the information I want of another array using the mask (say column 0 contains the one I want).
variable = allvariables[mask][:,0]
Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).
I thought of going through it once and writing out a binary file of an index (to speed up future searches).
for i in range(0,3):
mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
index[mask] = i
where X is in subsampleIDsX. Then I can just do:
for i in range(0,3):
if index[i] == i:
rowmatch = i
break
variable = allvariables[rowmatch:len(subsampleIDs),0]
right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?

I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs].
Create some test data first:
import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))
subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)
Then convert you data in to a DataFrame:
import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs)
df.ix[subsampleIDs]
DataFrame use a hashtable to map the index to it's location, it's very fast.

Often this kind of indexing is best performed using a DB (with proper column-indexing).
Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d, which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.
EDIT:
Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs, and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).
Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use np.unique on big arrays? - python

I think splitting Numpy array into smaller chunks and yield unique:count values will be memory efficient solution as well as changing data type to int16 or similar.

Related

How to access random indices from h5 data set?

Calculate partitioned sum efficiently with CuPy or NumPy

Initializing or populating multiple numpy arrays from h5 file groups

Construct sparse matrix on disk on the fly in Python

doing better than numpy's in1d mask function: ordered arrays?

Categories

Resources