Numpy array normalization by group ids:

Numpy array normalization by group ids: - python

Suppose data and labels be numpy arrays as below:
import numpy as np
data=np.array([[0,4,5,6,8],[0,6,8,9],[1,9,5],[1,45,7],[1,8,3]]) #Note: length of each row is different
labels=np.array([4,6,10,4,6])
The first element in each row in data shows an id of a group. I want to normalize (see below example) the labels based on the group ids:
For example the first two rows in data have id=0; thus, their label must be:
normalized_labels[0]=labels[0]/(4+6)=0.4
normalized_labels[1]=labels[1]/(4+6)=0.6
The expected output should be:
normalized_labels=[0.4,0.6,0.5,0.2,0.3]
I have a naive solution as:
ids=[data[i][0] for i in range(data.shape[0])]
out=[]
for i in set(ids):
ind=np.where(ids==i)
out.extend(list(labels[ind]/np.sum(labels[ind])))
out=np.array(out)
print(out)
Is there any numpy functions to perform such a task. Any suggestion is appreciated!!

I found this kind of subtle way to transform labels into sums of groups with respect to indices = [n[0] for n in data]. In later solution, no use of data is needed:
indices = [n[0] for n in data]
u, inv = np.unique(indices, return_inverse=True)
bincnt = np.bincount(inv, weights=labels)
sums = bincnt[inv]
Now sums are: array([10., 10., 20., 20., 20.]). The further is simple like so:
normalized_labels = labels / sums
Remarks. np.bincount calculates weighted sums of items labeled as 0, 1, 2... This is why reindexation indices -> inv is needed. For example, indices = [8, 6, 4, 3, 4, 6, 8, 8] should be mapped into inv = [3, 2, 1, 0, 1, 2, 3, 3].

Related

Summing a numpy array based on a multi-labeled mask

Say I have an array:
x = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
And a multi-labeled mask:
labels = np.array([[0, 0, 2],
[1, 1, 2],
[1, 1, 2]])
My goal is to sum the entries of x together, grouped by labels. For example:
n_labels = np.max(labels) + 1
out = np.empty(n_labels)
for label in range(n_labels):
mask = labels == label
out[label] = np.sum(x[mask])
>>> out
np.array([1, 20, 15])
However, as x and n_labels become large, I see this being inefficient. Each iteration, you are only summing together a small fraction of the number of entries of x, but still recompute the mask over all of labels (in the expression labels == label) and subsequently index over all of x (in the expression x[mask]). Is there a more efficient way to do this as x and n_labels grow large?

You can use bincount with weights:
np.bincount(labels.ravel(), weights=x.ravel())
out:
array([ 1., 20., 15.])

You don't really have a reason to operate on 2D arrays, so ravel them first:
labels = labels.ravel()
x = x.ravel()
If your labels are already indices, you can use np.argsort along with np.diff and np.add.reduceat:
index = labels.argsort()
splits = np.r_[0, np.flatnonzero(np.diff(labels[index])) + 1]
result = np.add.reduceat(x[index], splits)
labels[index] is the sorted index. Whenever that changes, you enter a new group, and the diff is nonzero. That's what np.flatnonzero(np.diff(labels[index])) finds for you. Since reduceat takes the stop index past the end of the run, you need to add one. np.r_ allows you to prepend zero easily to a 1D array, which is necessary for reduceat to regard t, and also prepend zero., and also prepend zero.he first run (the last is always automatic).
Before you run reduceat, you need to order x into the runs defined by labels, which is what x[index] does.

You can use 2D arrays with another slow and over-engineered approach using np.add.at
sums = np.zeros(labels.max()+1, x.dtype)
np.add.at(sums, labels, x)
sums
Output
array([ 1, 20, 15])

Subsampling a 1D array of integer so that the sum hits a target value in python

I have two 1D arrays of integers whose some differ, for example:
a = [1,2,2,0,3,5]
b = [0,0,3,2,0,0]
I would like the sum of each array to be equal to that of the smallest of the two. However I want to keep values as integers, not floats, so dividing is not an option. The solution appears to be some subsampling of the biggest array so that its sum is equal to that of the smallest one:
target = [min(sum(a), sum(b))]
However, I cannot find a function that would perform such subsampling. The only one I found are in scipy but they seem dedicated to treat audio signal. The alternative was a function of the scikit-bio package but it does not work on Python 3.7.

You could convert the array to indices, sample the indices and convert back to values as follows:
import numpy as np
np.random.seed(0)
a = np.array([1,2,2,0,3,5])
# Generate an array of indices, values in "a"
# define the number of occurences of their index
a_idx = np.array([i for i in range(len(a))])
a_idx = np.repeat(np.arange(len(a)), a)
# [0, 1, 1, 2, 2, 4, 4, 4, 5, 5, 5, 5, 5]
# Randomly shuffle indices and pick the n-first
a_sub_idx = np.random.permutation(a_idx)[:target]
# [4, 1, 2, 2, 5]
# Count the number of occurences of each index
a_sub_idx, a_sub_vals = np.unique(a_sub_idx, return_counts=True)
# Generate a new array of values the sampled indices
a_sub = np.zeros(a.shape)
a_sub[a_sub_idx] = a_sub_vals
# [0., 1., 2., 0., 1., 1.]

how to get a numpy array from frequency and indices

I have a numpy array like this:
nparr = np.asarray([[u'fals', u'nazi', u'increas', u'technolog', u'equip', u'princeton',
u'realiti', u'civilian', u'credit', u'ten'],
[u'million', u'thousand', u'nazi', u'stick', u'visibl', u'realiti',
u'west', u'singl', u'jack', u'charl']])
What I need to do is to calculate the frequency of each item, and have another numpy array with the corresponding frequency of each item in the same position.
So, here as my array shape is (2, 10). I need to have a numpy array of shape (2, 10) but with the frequency values. Thus, the output of the above would be:
[[1, 2, 1, 1, 1, 1, 2, 1, 1, 1]
[1, 1, 2, 1, 1, 2, 1, 1, 1, 1]]
What I have done so far:
unique, indices, count = np.unique(nparr, return_index=True, return_counts=True)
Though in this way the count is the frequency of unique values and it does not give me the same shape as the original array.

You need to use return_inverse rather than return_index:
_, i, c = np.unique(nparr, return_inverse=True, return_counts=True)
_ is a convention to denote discarded return values. You don't need the unique values to know where the counts go.
You can get the counts arranged in the order of the original array with a simple indexing operation. Unraveling to the original shape is necessary, of course:
c[i].reshape(nparr.shape)

Given two numpy arrays, find the item in array A with unique values in array B

I am trying to implement K-means by selective centroid selection. I have two numpy arrays, one called "features" which has a set of numpy arrays where each array is a datapoint and another np array called "labels", which has the label of class the data point at an index "i" belongs to. I have datapoints related to 4 different classes. What I want to do is to make use of both these numpy arrays, and randomly pick a datapoint one from each class. Could you please help me out with this. Also, is there any way to zip two numpy arrays into a dictionary?
for example I have the features array as :
[[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]]
and my labels array is
[1,2,2,3,1,3]
For each value unique in the labels numpy array, I want one randomly chosen corresponding element in the features array. A sample answer would be :
[1,1,1] from class 1
[1,6,7] from class 2
[1,4,2] from class 3

Given this is the setup in your question:
import numpy as np
features = [[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]]
labels = np.array([1,2,2,3,1,3])
This should get you a random variable from each label in dictionary form:
features_index = np.array(range(0, len(features)))
unique_labels = np.unique(labels)
rand = []
for n in unique_labels:
rand.append(features[np.random.choice(features_index[labels == n])])
dict(zip(unique_labels, rand))

Try:
import numpy as np
features = np.array([[1,1,1],[1,2,3],[1,6,7],[1,4,6],[1,6,9],[1,4,2]])
labels = np.array([1,2,2,3,1,3])
res = {i: features[np.random.choice(np.where(labels == i)[0])] for i in set(labels)}
output
{1: array([1, 1, 1]), 2: array([1, 2, 3]), 3: array([1, 4, 2])}

You can accomplish this with a bit of indexing and numpy.unique
u = np.unique(labels)
f = np.arange(features.shape[0])
idx = np.random.choice(
f, u.shape[0], replace=False
)
dict(zip(u, features[idx]))
{1: array([1, 4, 2]), 2: array([1, 6, 9]), 3: array([1, 1, 1])}

Efficient Histogram of Differences for sparse Data

I want to compute a histogram of the differences between all the elements in one array A with all the elements in another array B.
So I want to have a histogram of the following data:
Delta1 = A1-B1
Delta2 = A1-B2
Delta3 = A1-B3
...
DeltaN = A2-B1
DeltaN+1 = A2-B2
DeltaN+2 = A2-B3
...
The point of this calculation is to show that these data has a correlation, even though not every data point has a "partner" in the other array and the correlation is rather noisy in practice.
The problem is that these files are in practice very large, several GB and all entries of the vectors are 64 bit integer numbers with very large differences.
It seems unfeasible to me to convert these data to binary arrays in order to be able to use correlation functions and fourier transforms to compute this.
Here is a small example to give a better taste of what I'm looking at.
This implementation with numpy's searchsorted in a for loop is rather slow.
import numpy as np
import matplotlib.pyplot as plt
timetagsA = [668656283,974986989,1294941174,1364697327,\
1478796061,1525549542,1715828978,2080480431,2175456303,2921498771,3671218524,\
4186901001,4444689281,5087334517,5467644990,5836391057,6249837363,6368090967,8344821453,\
8933832044,9731229532]
timetagsB = [13455,1294941188,1715828990,2921498781,5087334530,5087334733,6368090978,9731229545,9731229800,9731249954]
max_delta_t = 500
nbins = 10000
histo=np.zeros((nbins,2), dtype = float)
histo[:,0]=np.arange(0,nbins)
for i in range(0,int(len(timetagsA))):
delta_t = 0
j = np.searchsorted(timetagsB,timetagsA[i])
while (np.round(delta_t) < max_delta_t and j<len(timetagsB)):
delta_t = timetagsB[j] - timetagsA[i]
if(delta_t<max_delta_t):
histo[int(delta_t),1]+=1
j = j+1
plt.plot(histo[0:50,1])
plt.show()
It would be great if someone could help me to find a faster way to compute this. Thanks in advance!

EDIT
The below solution is supposing that your data is so huge that you can not use np.substract with np.outer and then slice the value you want to keep:
arr_diff = np.subtract.outer(arrB, arrA)
print (arr_diff[(0<arr_diff ) &(arr_diff <max_delta_t)])
# array([ 14, 12, 10, 13, 216, 11, 13, 268], dtype=int64)
with your example data it works but not with too huge data set
ORIGINAL SOLUTION
Let's first suppose your max_delta_t is smaller than the difference between two successive values in timetagsB for an easy way of doing it (then we can try to generalize it).
#create the array instead of list
arrA = np.array(timetagsA)
arrB = np.array(timetagsB)
max_delta_t = np.diff(arrB).min() - 1 #here it's 202 just for the explanation
You can use np.searchsorted in a vectorize way:
# create the array of search
arr_search = np.searchsorted(arrB, arrA) # the position of each element of arrA in arrB
print (arr_search)
# array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 7, 7, 7],dtype=int64)
You can calculate the difference between the element of arrB corresponding to each element of arrA by slicing arrB with arr_search
# calculate the difference
arr_diff = arrB[arr_search] - arrA
print (arr_diff[arr_diff<max_delta_t]) # finc the one smaller than max_delta_t
# array([14, 12, 10, 13, 11, 13], dtype=int64)
So what you are looking for is then calculated by np.bincount
arr_bins = np.bincount(arr_diff[arr_diff<max_delta_t])
#to make it look like histo but not especially necessary
histo = np.array([range(len(arr_bins)),arr_bins]).T
Now the problem is that, there is some values of difference between arrA and arrB that could not be obtained with this method, when max_delta_t is bigger than two successive values in arrB. Here is one way, naybe not the most efficient depending on the values of your data. For any value of max_delta_t
#need an array with the number of elements in arrB for each element of arrA
# within a max_delta_t range
arr_diff_search = np.searchsorted(arrB, arrA + max_delta_t)- np.searchsorted(arrB, arrA)
#do a loop to calculate all the values you are interested in
list_arr = []
for i in range(arr_diff_search.max()+1):
arr_diff = arrB[(arr_search+i)%len(arrB)][(arr_diff_search>=i)] - arrA[(arr_diff_search>=i)]
list_arr.append(arr_diff[(0<arr_diff)&(arr_diff<max_delta_t)])
Now you can np.concatenate the list_arr and use np.bincount such as:
arr_bins = np.bincount(np.concatenate(list_arr))
histo = np.array([range(len(arr_bins)),arr_bins]).T

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy array normalization by group ids: - python

Related

Summing a numpy array based on a multi-labeled mask

Subsampling a 1D array of integer so that the sum hits a target value in python

how to get a numpy array from frequency and indices

Given two numpy arrays, find the item in array A with unique values in array B

Efficient Histogram of Differences for sparse Data

Categories

Resources