Efficient Histogram of Differences for sparse Data

Efficient Histogram of Differences for sparse Data - python

I want to compute a histogram of the differences between all the elements in one array A with all the elements in another array B.
So I want to have a histogram of the following data:
Delta1 = A1-B1
Delta2 = A1-B2
Delta3 = A1-B3
...
DeltaN = A2-B1
DeltaN+1 = A2-B2
DeltaN+2 = A2-B3
...
The point of this calculation is to show that these data has a correlation, even though not every data point has a "partner" in the other array and the correlation is rather noisy in practice.
The problem is that these files are in practice very large, several GB and all entries of the vectors are 64 bit integer numbers with very large differences.
It seems unfeasible to me to convert these data to binary arrays in order to be able to use correlation functions and fourier transforms to compute this.
Here is a small example to give a better taste of what I'm looking at.
This implementation with numpy's searchsorted in a for loop is rather slow.
import numpy as np
import matplotlib.pyplot as plt
timetagsA = [668656283,974986989,1294941174,1364697327,\
1478796061,1525549542,1715828978,2080480431,2175456303,2921498771,3671218524,\
4186901001,4444689281,5087334517,5467644990,5836391057,6249837363,6368090967,8344821453,\
8933832044,9731229532]
timetagsB = [13455,1294941188,1715828990,2921498781,5087334530,5087334733,6368090978,9731229545,9731229800,9731249954]
max_delta_t = 500
nbins = 10000
histo=np.zeros((nbins,2), dtype = float)
histo[:,0]=np.arange(0,nbins)
for i in range(0,int(len(timetagsA))):
delta_t = 0
j = np.searchsorted(timetagsB,timetagsA[i])
while (np.round(delta_t) < max_delta_t and j<len(timetagsB)):
delta_t = timetagsB[j] - timetagsA[i]
if(delta_t<max_delta_t):
histo[int(delta_t),1]+=1
j = j+1
plt.plot(histo[0:50,1])
plt.show()
It would be great if someone could help me to find a faster way to compute this. Thanks in advance!

EDIT
The below solution is supposing that your data is so huge that you can not use np.substract with np.outer and then slice the value you want to keep:
arr_diff = np.subtract.outer(arrB, arrA)
print (arr_diff[(0<arr_diff ) &(arr_diff <max_delta_t)])
# array([ 14, 12, 10, 13, 216, 11, 13, 268], dtype=int64)
with your example data it works but not with too huge data set
ORIGINAL SOLUTION
Let's first suppose your max_delta_t is smaller than the difference between two successive values in timetagsB for an easy way of doing it (then we can try to generalize it).
#create the array instead of list
arrA = np.array(timetagsA)
arrB = np.array(timetagsB)
max_delta_t = np.diff(arrB).min() - 1 #here it's 202 just for the explanation
You can use np.searchsorted in a vectorize way:
# create the array of search
arr_search = np.searchsorted(arrB, arrA) # the position of each element of arrA in arrB
print (arr_search)
# array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 7, 7, 7],dtype=int64)
You can calculate the difference between the element of arrB corresponding to each element of arrA by slicing arrB with arr_search
# calculate the difference
arr_diff = arrB[arr_search] - arrA
print (arr_diff[arr_diff<max_delta_t]) # finc the one smaller than max_delta_t
# array([14, 12, 10, 13, 11, 13], dtype=int64)
So what you are looking for is then calculated by np.bincount
arr_bins = np.bincount(arr_diff[arr_diff<max_delta_t])
#to make it look like histo but not especially necessary
histo = np.array([range(len(arr_bins)),arr_bins]).T
Now the problem is that, there is some values of difference between arrA and arrB that could not be obtained with this method, when max_delta_t is bigger than two successive values in arrB. Here is one way, naybe not the most efficient depending on the values of your data. For any value of max_delta_t
#need an array with the number of elements in arrB for each element of arrA
# within a max_delta_t range
arr_diff_search = np.searchsorted(arrB, arrA + max_delta_t)- np.searchsorted(arrB, arrA)
#do a loop to calculate all the values you are interested in
list_arr = []
for i in range(arr_diff_search.max()+1):
arr_diff = arrB[(arr_search+i)%len(arrB)][(arr_diff_search>=i)] - arrA[(arr_diff_search>=i)]
list_arr.append(arr_diff[(0<arr_diff)&(arr_diff<max_delta_t)])
Now you can np.concatenate the list_arr and use np.bincount such as:
arr_bins = np.bincount(np.concatenate(list_arr))
histo = np.array([range(len(arr_bins)),arr_bins]).T

Related

Summing a numpy array based on a multi-labeled mask

Say I have an array:
x = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And a multi-labeled mask:
labels = np.array([[0, 0, 2],
[1, 1, 2],
[1, 1, 2]])
My goal is to sum the entries of x together, grouped by labels. For example:
n_labels = np.max(labels) + 1
out = np.empty(n_labels)
for label in range(n_labels):
mask = labels == label
out[label] = np.sum(x[mask])
>>> out
np.array([1, 20, 15])
However, as x and n_labels become large, I see this being inefficient. Each iteration, you are only summing together a small fraction of the number of entries of x, but still recompute the mask over all of labels (in the expression labels == label) and subsequently index over all of x (in the expression x[mask]). Is there a more efficient way to do this as x and n_labels grow large?

You can use bincount with weights:
np.bincount(labels.ravel(), weights=x.ravel())
out:
array([ 1., 20., 15.])

You don't really have a reason to operate on 2D arrays, so ravel them first:
labels = labels.ravel()
x = x.ravel()
If your labels are already indices, you can use np.argsort along with np.diff and np.add.reduceat:
index = labels.argsort()
splits = np.r_[0, np.flatnonzero(np.diff(labels[index])) + 1]
result = np.add.reduceat(x[index], splits)
labels[index] is the sorted index. Whenever that changes, you enter a new group, and the diff is nonzero. That's what np.flatnonzero(np.diff(labels[index])) finds for you. Since reduceat takes the stop index past the end of the run, you need to add one. np.r_ allows you to prepend zero easily to a 1D array, which is necessary for reduceat to regard t, and also prepend zero., and also prepend zero.he first run (the last is always automatic).
Before you run reduceat, you need to order x into the runs defined by labels, which is what x[index] does.

You can use 2D arrays with another slow and over-engineered approach using np.add.at
sums = np.zeros(labels.max()+1, x.dtype)
np.add.at(sums, labels, x)
sums
Output
array([ 1, 20, 15])

Numpy array normalization by group ids:

Suppose data and labels be numpy arrays as below:
import numpy as np
data=np.array([[0,4,5,6,8],[0,6,8,9],[1,9,5],[1,45,7],[1,8,3]]) #Note: length of each row is different
labels=np.array([4,6,10,4,6])
The first element in each row in data shows an id of a group. I want to normalize (see below example) the labels based on the group ids:
For example the first two rows in data have id=0; thus, their label must be:
normalized_labels[0]=labels[0]/(4+6)=0.4
normalized_labels[1]=labels[1]/(4+6)=0.6
The expected output should be:
normalized_labels=[0.4,0.6,0.5,0.2,0.3]
I have a naive solution as:
ids=[data[i][0] for i in range(data.shape[0])]
out=[]
for i in set(ids):
ind=np.where(ids==i)
out.extend(list(labels[ind]/np.sum(labels[ind])))
out=np.array(out)
print(out)
Is there any numpy functions to perform such a task. Any suggestion is appreciated!!

I found this kind of subtle way to transform labels into sums of groups with respect to indices = [n[0] for n in data]. In later solution, no use of data is needed:
indices = [n[0] for n in data]
u, inv = np.unique(indices, return_inverse=True)
bincnt = np.bincount(inv, weights=labels)
sums = bincnt[inv]
Now sums are: array([10., 10., 20., 20., 20.]). The further is simple like so:
normalized_labels = labels / sums
Remarks. np.bincount calculates weighted sums of items labeled as 0, 1, 2... This is why reindexation indices -> inv is needed. For example, indices = [8, 6, 4, 3, 4, 6, 8, 8] should be mapped into inv = [3, 2, 1, 0, 1, 2, 3, 3].

Sum up data on specific (multiple) ranges

I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.

This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))

You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]

An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.

Extract sub rows with varying sizes from a big 2D NumPy Array

I have a NumPy Array with size say 3*10, I would like to extract sub rows with varying sizes from each row. The sub rows are centered in the middle pixel with varying pixel sizes. Then I take the average number of each subrow. I have a pseudo example below:
import numpy as np
arr = np.arange(1,31).reshape((3,10))
pixel_size = np.array([2,3,1])
## the subrow centers in the middle of the array, index 5
mask = [[5-2:5+2],[5-3:5+3],[5-1:5+1]] ## index for each row
### submatrix = arr[;,mask]
submatrix = [[3,4,5,6],[12,13,14,15,16,17],[24,25]]
## output = np.mean(submatrix, axis=1) output is the average number of each row in the submatrix
output = [4.5,14.5,24.5]
If I have over 10 millions of rows, how can I handle this situation fast.

You can do it using list comprehensions and index slicing:
import numpy as np
arr = np.arange(1,31).reshape((3,10))
pixel_size = np.array([2,3,1])
middle_ind = int(arr.shape[1]/2.)
print middle_ind
sub_arr = [arr[i,middle_ind - pixel_size[i]:middle_ind + pixel_size[i]] for i in range(len(pixel_size))]
print('sub_arr: ', sub_arr)
output = [np.mean(item) for item in sub_arr]
print('output: ', output)
> sub_arr: [array([4, 5, 6, 7]), array([13, 14, 15, 16, 17, 18]), array([25, 26])]
> output: [5.5, 15.5, 25.5]
Your submatrix is a list not an array so it's more difficult to vectorize operations. You might want to think about restructuring your code to take advantage of matrix operations.

How to replicate numpy.choose() in tensorflow?

I'm trying to efficiently replicate numpy's ndarray.choose() method.
Here's a numpy example of what I'm looking for:
b = np.arange(15).reshape(3, 5)
c = np.array([1,0,4])
c.choose(b.T) # trying to replicate in tensorflow
-> array([ 1, 5, 14])
The best I've been able to do with this is generate a batch_size square matrix (which is huge if batch size is huge) and take the diagonal of it:
tf_b = tf.constant(b)
tf_c = tf.constant(c)
sess.run(tf.diag_part(tf.gather(tf.transpose(tf_b), tf_c)))
-> array([ 1, 5, 14])
Is there a way to do this that is just linear in the first dimension (instead of squared)?

Yeah, there's an easier way to do this. Flatten your b array to 1-d, so it's [0, 1, 2, ..., 13, 14]. Take an array of indices that are in the range of the number of 'choices' you are taking (3 in your case). That will be [0, 1, 2]. Multiply this range by the second dimension of your original shape, which is the number of options for each choice (5 in your case). That gives you [0, 5, 10]. Then add your indices to this to obtain [1, 5, 14]. Now you're good to call tf.gather().
Here is some code that I've taken from here that does a similar thing for RNN outputs. Yours will be slightly different, but the idea is the same.
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(output, [-1, out_size])
relevant = tf.gather(flat, index)
return relevant
In a big picture, the operation is pretty straightforward. You use the range operation to get the index of the beginning of each row, then add the index of where you are in each row. I think doing it in 1D is easiest, so that's why we flatten it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient Histogram of Differences for sparse Data - python

Related

Summing a numpy array based on a multi-labeled mask

Numpy array normalization by group ids:

Sum up data on specific (multiple) ranges

Extract sub rows with varying sizes from a big 2D NumPy Array

How to replicate numpy.choose() in tensorflow?

Categories

Resources