how to get a numpy array from frequency and indices - python

I have a numpy array like this:
nparr = np.asarray([[u'fals', u'nazi', u'increas', u'technolog', u'equip', u'princeton',
u'realiti', u'civilian', u'credit', u'ten'],
[u'million', u'thousand', u'nazi', u'stick', u'visibl', u'realiti',
u'west', u'singl', u'jack', u'charl']])
What I need to do is to calculate the frequency of each item, and have another numpy array with the corresponding frequency of each item in the same position.
So, here as my array shape is (2, 10). I need to have a numpy array of shape (2, 10) but with the frequency values. Thus, the output of the above would be:
[[1, 2, 1, 1, 1, 1, 2, 1, 1, 1]
[1, 1, 2, 1, 1, 2, 1, 1, 1, 1]]
What I have done so far:
unique, indices, count = np.unique(nparr, return_index=True, return_counts=True)
Though in this way the count is the frequency of unique values and it does not give me the same shape as the original array.

You need to use return_inverse rather than return_index:
_, i, c = np.unique(nparr, return_inverse=True, return_counts=True)
_ is a convention to denote discarded return values. You don't need the unique values to know where the counts go.
You can get the counts arranged in the order of the original array with a simple indexing operation. Unraveling to the original shape is necessary, of course:
c[i].reshape(nparr.shape)

Related

How can I index each occurrence of a max value along a given axis of a numpy array?

Suppose I have the following numpy array.
Q = np.array([[0,1,1],[1,0,1],[0,2,0])
Question: How do I identify the position of each max value along axis 1? So the desired output would be something like:
array([[1,2],[0,2],[1]]) # The dtype of the output is not required to be a np array.
With np.argmax I can identify the first occurrence of the maximum along the axis, but not the subsequent values.
In: np.argmax(Q, axis =1)
Out: array([1, 0, 1])
I've also seen answers that rely on using np.argwhere that use a term like this.
np.argwhere(Q == np.amax(Q))
This will also not work here because I can't limit argwhere to work along a single axis. I also can't just flatten out the np array to a single axis because the max's in each row will differ. I need to identify each instance of the max of each row.
Is there a pythonic way to achieve this without looping through each row of the entire array, or is there a function analogous to np.argwhere that accepts an axis argument?
Any insight would be appreciated thanks!
Try with np.where:
np.where(Q == Q.max(axis=1)[:,None])
Output:
(array([0, 0, 1, 1, 2]), array([1, 2, 0, 2, 1]))
Not quite the output you want, but contains equivalent information.
You can also use np.argwhere which gives you the zip data:
np.argwhere(Q==Q.max(axis=1)[:,None])
Output:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2],
[2, 1]])

How can I find indexes with same elements in 2d numpy array?

I'm working on a machine vision project. By reflecting laser light on the picture, I detect the pixels that the laser light falls on the picture with the help of Opencv. I keep these pixel values ​​as 2d numpy array. However, I want to make the x, y values ​​unique by determining the pixel values ​​whose x axis values ​​are the same and taking the average of them. Pixel values ​​are kept sequentially in numpy array.
For example:
[[659 253]
[660 253]
[660 256]
[661 253]
[662 253]
[663 253]
[664 253]
[665 253]]
First of all, my goal is to identify all lists in which the first element of each list is the same. When using Opencv, pixel values ​​are kept in numpy arrays to be more useful. I'm trying to write an indexing method myself. I created a numpy array for myself to make it simpler.
x = np.array([[1, 2], [1, 78], [1, 3], [1, 6], [4, 3], [5, 6], [5, 3]], np.int32)
I followed a method like this to find the values ​​whose first element is the same from the lists in the x array.
for i in range (len (x)):
if x [i]! = x [-1] and x [i] [0] == x [i + 1] [0]:
print (x [i], x [i + 1])
I want to check if the first element in the first list is in the next lists by browsing the x array list. In order not to face an index out of range error, I used x [i]! = x [-1]. I was expecting this loop to return below result to me.
[1,2] [1,78]
[1,78] [1,3]
[1,3] [1,6]
[5,6] [5,3]
I would later remove duplicate elements from the list but I got
ValueError: The truth value of an array with more than one element is ambiguous.Use a.any() or a.all()
I am not familiar with numpy arrays so I could not get the solution I wanted. Is it possible to get the result I want using numpy array methods? Thanks for your time.
Approach 1
This is a numpy way to do this:
x_sorted = x[np.argsort(x[:,0])]
marker_idx = np.flatnonzero(np.diff(x_sorted[:,0]))+1
output = np.split(x_sorted, marker_idx)
Approach 2
You can also use a package numpy_indexed which is designed to solve groupby problems with less script and without loss of performance:
import numpy_indexed as npi
npi.group_by(x[:, 0]).split(x)
Approach 3
You can get groups of indices but this might not be the best option because of list comprehension:
import pandas as pd
[x[idx] for idx in pd.DataFrame(x).groupby([0]).indices.values()]
Output
[array([[ 1, 2],
[ 1, 78],
[ 1, 3],
[ 1, 6],
[ 1, 234]]),
array([[4, 3]]),
array([[5, 6],
[5, 3]])]
Try the following, using itertools.groupby:
x.sort(axis=0)
for l in [list([tuple(p) for p in k]) for i,k in itertools.groupby(x, key=lambda x: x[0])]:
print(l)
Output:
[(1, 2), (1, 3), (1, 4), (1, 5), (1, 6)]
[(3, 6), (3, 78)]
[(5, 234)]
You can use np.unique with its return_inverse argument, which is effectively a sorting index, and return_counts, which is going to help build the split points:
_, ind, cnt = np.unique(x[:, 0], return_index=True, return_counts=True)
The index i arranges u into x. To sort the other way, you need to invert the index. Luckily, np.argsort is its own inverse:
ind = np.argsort(ind)
To get the splitpoints of the data, you can use np.cumsum on the count. You don't need the last element because it is always going to mark the end of the array:
spp = np.cumsum(cnt[:-1])
Finally, you can use np.split to get the list of sub-arrays that you want:
result = np.split(x[ind, :], spp, axis=0)
TL;DR
_, ind, cnt = np.unique(x[:, 0], return_index=True, return_counts=True)
np.split(x[np.argsort(ind), :], np.cumsum(cnt[:-1]), axis=0)

Numpy array normalization by group ids:

Suppose data and labels be numpy arrays as below:
import numpy as np
data=np.array([[0,4,5,6,8],[0,6,8,9],[1,9,5],[1,45,7],[1,8,3]]) #Note: length of each row is different
labels=np.array([4,6,10,4,6])
The first element in each row in data shows an id of a group. I want to normalize (see below example) the labels based on the group ids:
For example the first two rows in data have id=0; thus, their label must be:
normalized_labels[0]=labels[0]/(4+6)=0.4
normalized_labels[1]=labels[1]/(4+6)=0.6
The expected output should be:
normalized_labels=[0.4,0.6,0.5,0.2,0.3]
I have a naive solution as:
ids=[data[i][0] for i in range(data.shape[0])]
out=[]
for i in set(ids):
ind=np.where(ids==i)
out.extend(list(labels[ind]/np.sum(labels[ind])))
out=np.array(out)
print(out)
Is there any numpy functions to perform such a task. Any suggestion is appreciated!!
I found this kind of subtle way to transform labels into sums of groups with respect to indices = [n[0] for n in data]. In later solution, no use of data is needed:
indices = [n[0] for n in data]
u, inv = np.unique(indices, return_inverse=True)
bincnt = np.bincount(inv, weights=labels)
sums = bincnt[inv]
Now sums are: array([10., 10., 20., 20., 20.]). The further is simple like so:
normalized_labels = labels / sums
Remarks. np.bincount calculates weighted sums of items labeled as 0, 1, 2... This is why reindexation indices -> inv is needed. For example, indices = [8, 6, 4, 3, 4, 6, 8, 8] should be mapped into inv = [3, 2, 1, 0, 1, 2, 3, 3].

Python zip method and rounding behaviour

Just seen what seems a rather curious behaviour in the python zip() built-in. I passed it a Numpy array of rounded decimals but it spits out an expanded version.
This the original array, my goal is to generate a dictionary with the proportion of occupancy of each unique element. np is Numpy.
a = np.array([1, 2, 3, 1, 1, 2, 1])
So I go doing
elems, counts = np.unique(a, return_counts=True)
which spits (array([1, 2, 3]), array([4, 2, 1])). Correct. But now I want the proportion, not the count (rounded to the third digit), so I do
counts = np.round(counts/a.size, 3)
which gives array([ 0.571, 0.286, 0.143]) for counts. Now into zipping this into the sought dict:
dict(zip(*(elems, counts)))
This spits {1: 0.57099999999999995, 2: 0.28599999999999998, 3: 0.14299999999999999}, so looks like the rounded counts have seen some digits added!
Numpy just displays a different amount of significant digits when printing numpy arrays. You can adjust the printing precision with set_printoptions.
Example using your data:
import numpy as np
a = np.array([1, 2, 3, 1, 1, 2, 1])
elems, counts = np.unique(a, return_counts=True)
counts = np.round(counts/a.size, 3)
np.set_printoptions(precision=20)
print(counts)
outputs:
[ 0.57099999999999995204 0.28599999999999997646 0.14299999999999998823]

Gather elements along second dimension of tensor

Assume values and tensor T both have shape (N,K). Now if we think of them in terms of matrices, I would like for each row of T to get the row element corresponding to the index where values has it's maximum. I can easily find those indices with
max_indicies = tf.argmax(T, 1)
which returns a tensor of shape (N). Now, how can I gather up these indices from T such that I get something of shape N? I tried
result = tf.gather(T,max_indices)
but it doesn't do the right thing - it returns something of shape (N,K) which means that it didn't gather up anything.
You can use tf.gather_nd.
For example,
import tensorflow as tf
sess = tf.InteractiveSession()
values = tf.constant([[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0]])
T = tf.constant([[0, 1, 2 , 3],
[4, 5, 6 , 7],
[8, 9, 10, 11]])
max_indices = tf.argmax(values, axis=1)
# If T.get_shape()[0] is None, you can replace it with tf.shape(T)[0].
result = tf.gather_nd(T, tf.stack((tf.range(T.get_shape()[0],
dtype=max_indices.dtype),
max_indices),
axis=1))
print(result.eval())
However when the ranks of values and T are higher, the use of tf.gather_nd will be a little awkward. I posted my current solution on this question. There might be a better solution in case of high dimensional values and T.

Categories