Replacing NumPy array entries with their frequencies / values from dictionary - python

Problem: From two input arrays, I want to output an array with the frequency of True values (from input_2) corresponding to each value of input_1.
import numpy as np # import everything from numpy
from scipy.stats import itemfreq
input_1 = np.array([3,6,6,3,6,4])
input_2 = np.array([False, True, True, False, False, True])
For this example output that I want is:
output_1 = np.array([0,2,2,0,2,1])
My current approach involves editing input_1, so only the values corresponding to True remain:
locs=np.where(input_2==True,input_1,0)
Then counting the frequency of each answer, creating a dictionary and replacing the appropriate keys of input_1 to values (the True frequencies).
loc_freq = itemfreq(locs)
dic = {}
for key,val in loc_freq:
dic[key]=val
print dic
for k, v in dic.iteritems():
input_1[input_1==k]=v
which outputs [3,2,2,3,2,1].
The problem here is twofold:
1) this still does not do anything with the keys that are not in the dictionary (and should therefore be changed to 0). For example, how can I get the 3s transformed into 0s?
2) This seems very inelegant / ineffective. Is there a better way to approach this?

np.bincount is what you are looking for.
output_1 = np.bincount(input_1[input_2])[input_1]

#memecs solution is correct, +1. However it will be very slow and take a lot of memory if the values in input_1 are really large, i.e. they are not indices of an array, but say they are seconds or some other integer data that can take very large values.
In that case, you have that np.bincount(input_1[input_2]).size is equal to the largest integer in input_1 with a True value in input_2.
It is much faster to use unique and bincount. We use the first to extract the indices of the unique elements of input_1, and then use bincount to count how often these indices appear in that same array, and weigh them 1 or 0 based on the value of the array input_2 (True or False):
# extract unique elements and the indices to reconstruct the array
unq, idx = np.unique(input_1, return_inverse=True)
# calculate the weighted frequencies of these indices
freqs_idx = np.bincount(idx, weights=input_2)
# reconstruct the array of frequencies of the elements
frequencies = freqs_idx[idx]
print(frequencies)
This solution is really fast and has the minimum memory impact. Credit goes to #Jaime, see his comment below. Below I report my original answer, using unique in a different manner.
OTHER POSSIBILITY
It may be faster to go for another solution, using unique:
import numpy as np
input_1 = np.array([3, 6, 6, 3, 6, 4])
input_2 = np.array([False, True, True, False, False, True])
non_zero_hits, counts = np.unique(input_1[input_2], return_counts=True)
all_hits, idx = np.unique(input_1, return_inverse=True)
frequencies = np.zeros_like(all_hits)
#2nd step, with broadcasting
idx_non_zero_hits_in_all_hits = np.where(non_zero_hits[:, np.newaxis] - all_hits == 0)[1]
frequencies[idx_non_zero_hits_in_all_hits] = counts
print(frequencies[idx])
This has the drawback that it will require a lot of memory if the number of unique elements in input_1 with a True value in input_2 are many, because of the 2D array created and passed to where. To reduce the memory footprint, you could use a for loop instead for the 2nd step of the algorithm:
#2nd step, but with a for loop.
for j, val in enumerate(non_zero_hits):
index = np.where(val == all_hits)[0]
frequencies[index] = counts[j]
print(frequencies[idx])
This second solution has a very small memory footprint, but requires a for loop. It depends on your typical data input size and values which solution will be best.

The currently accepted bincount solution is quite elegant, but the numpy_indexed package provides more general solutions to problems of this kind:
import numpy_indexed as npi
idx = npi.as_index(input_1)
unique_labels, true_count_per_label = npi.group_by(idx).sum(input_2)
print(true_count_per_label[idx.inverse])

Related

Finding 3D indices of all matching values in numpy

I have a 3D int64 Numpy array, which is output from skimage.measure.label. I need a list of 3D indices that match each of our possible (previously known) values, separated out by which indices correspond to each value.
Currently, we do this by the following idiom:
for cur_idx,count in values_counts.items():
region=labels[:,:,:] == cur_idx
[dim1_indices,dim2_indices,dim3_indices]= np.nonzero(region)
While this code works and produces correct output, it is quite slow, especially the np.nonzero part, as we call this 200+ times on a large array. I realize that there is probably a faster way to do this via, say, numba, but we'd like to avoid adding on additional requirements unless needed.
Ultimately, what we're looking for is a list of indices that correspond to each (nonzero) value relatively efficiently. Assume our number of values <1000 but our array size >100x1000x1000. So, for example, on the array created by the following:
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 2
we would want some idx_value dict/array such that idx_value_1[2] = 1 idx_value_2[2] = 2, idx_value_3[2] = 3.
I've tried tackling problems similar to the one you describe, and I think the np.argwhere function is probably your best option for reducing runtime (see docs here). See the code example below for how this could be used per the constraints you identify above.
import numpy as np
x = np.zeros((4,4,4))
x[3,3,3] = 1; x[1,0,3] = 2; x[1,2,3] = 3
# Instantiate dictionary/array to store indices
idx_value = {}
# Get indices equal to 2
idx_value[3] = np.argwhere(x == 3)
idx_value[2] = np.argwhere(x == 2)
idx_value[1] = np.argwhere(x == 1)
# Display idx_value - consistent with indices we set before
>>> idx_value
{3: array([[1, 2, 3]]), 2: array([[1, 0, 3]]), 1: array([[3, 3, 3]])}
For the first use case, I think you would still have to use a for loop to iterate over the values you're searching over, but it could be done as:
# Instantiate dictionary/array
idx_value = {}
# Now loop by incrementally adding key/value pairs
for cur_idx,count in values_counts.items():
idx_value[cur_idx] = np.argwhere(labels)
NOTE: This incrementally creates a dictionary where each key is an idx to search for, and each value is a np.array object of shape (N_matches, 3).

Finding mode in np.array 1d and get the first one

I want to find the mode and get the first one for numpyarray
for example
[1,2,2,3,3,4] there are two mode(most frequently appears) 2,3
However , in this case I want to get the most left one 2
There are some examples to get mode by numpy or scipy or staticts
My array is numpy, so if I can do with only numpy, it is simple...
However how can I make this??
Have you had a look at collections.Counter?
import numpy as np
import collections
x = np.array([1, 2, 3, 2, 4, 3])
c = collections.Counter(x)
largest_num, count = c.most_common(1)[0]
The documentation of Counter.most_common states:
Elements with equal counts are ordered in the order first encountered
If you want to use only numpy, you could try this, based on the scipy mode implementation
a = np.array([1,2,2,3,3,4])
# Unique values
scores = set(np.ravel(a))
# Retrieve keys, counts
keys, counts = zip(*[(score, np.sum((a == score))) for score in scores])
# Maximum key
keys[np.argmax(counts)]
>>> 2
where the argmax function states
"In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned."

Python compute a specific inner product on vectors

Assume having two vectors with m x 6, n x 6
import numpy as np
a = np.random.random(m,6)
b = np.random.random(n,6)
using np.inner works as expected and yields
np.inner(a,b).shape
(m,n)
with every element being the scalar product of each combination. I now want to compute a special inner product (namely Plucker). Right now im using
def pluckerSide(a,b):
a0,a1,a2,a3,a4,a5 = a
b0,b1,b2,b3,b4,b5 = b
return a0*b4+a1*b5+a2*b3+a4*b0+a5*b1+a3*b2
with a,b sliced by a for loop. Which is way too slow. Any plans on vectorizing fail. Mostly broadcast errors due to wrong shapes. Cant get np.vectorize to work either.
Maybe someone can help here?
There seems to be an indexing based on some random indices for pairwise multiplication and summing on those two input arrays with function pluckerSide. So, I would list out those indices, index into the arrays with those and finally use matrix-multiplication with np.dot to perform the sum-reduction.
Thus, one approach would be like this -
a_idx = np.array([0,1,2,4,5,3])
b_idx = np.array([4,5,3,0,1,2])
out = a[a_idx].dot(b[b_idx])
If you are doing this in a loop across all rows of a and b and thus generating an output array of shape (m,n), we can vectorize that, like so -
out_all = a[:,a_idx].dot(b[:,b_idx].T)
To make things a bit easier, we can re-arrange a_idx such that it becomes range(6) and re-arrange b_idx with that pattern. So, we would have :
a_idx = np.array([0,1,2,3,4,5])
b_idx = np.array([4,5,3,2,0,1])
Thus, we can skip indexing into a and the solution would be simply -
a.dot(b[:,b_idx].T)

location of array of values in numpy array

Here is a small code to illustrate the problem.
A = array([[1,2], [1,0], [5,3]])
f_of_A = f(A) # this is precomputed and expensive
values = array([[1,2], [1,0]])
# location of values in A
# if I just had 1d values I could use numpy.in1d here
indices = array([0, 1])
# example of operation type I need (recalculating f_of_A as needed is not an option)
f_of_A[ indices ]
So, basically I think I need some equivalent to in1d for higher dimensions. Does such a thing exist? Or is there some other approach?
Looks like there is also a searchsorted() function, but that seems to work for 1d arrays also. In this example I used 2d points, but any solution would need to work for 3d points also.
Okay, this is what I came up with.
To find the value of one multi-dimensional index, let's say ii = np.array([1,2]), we can do:
n.where((A == ii).all(axis=1))[0]
Let's break this down, we have A == ii, which will give element-wise comparisons with ii for each row of A. We want an entire row to be true, so we add .all(axis=1) to collapse them. To find where these indices happen, we plug this into np.where and get the first value of the tuple.
Now, I don't have a fast way to do this with multiple indices yet (although I have a feeling there is one). However, this will get the job done:
np.hstack([np.where((A == values[i]).all(axis=1))[0] for i in xrange(len(values))])
This basically just calls the above, for each value of values, and concatenates the result.
Update:
Here is for the multi-dimensional case (all in one go, should be fairly fast):
np.where((np.expand_dims(A, -1) == values.T).all(axis=1).any(axis=1))[0]
You can use np.in1d over a view of your original array with all coordinates collapsed into a single variable of dtype np.void:
import numpy as np
A = np.array([[1,2], [1,0], [5,3]])
values = np.array([[1,2], [1,0]])
# Make sure both arrays are contiguous and have common dtype
common_dtype = np.common_type(A, values)
a = np.ascontiguousarray(A, dtype=common_dtype)
vals = np.ascontiguousarray(values, dtype=common_dtype)
a_view = A.view((np.void, A.dtype.itemsize*A.shape[1])).ravel()
values_view = values.view((np.void,
values.dtype.itemsize*values.shape[1])).ravel()
Now each item of a_view and values_view is all coordinates for one point packed together, so you can do whatever 1D magic you would use. I don't see how to use np.in1d to find indices though, so I would go the np.searchsorted route:
sort_idx = np.argsort(a_view)
locations = np.searchsorted(a_view, values_view, sorter=sort_idx)
locations = sort_idx[locations]
>>> locations
array([0, 1], dtype=int64)

how do I remove columns from a collection of ndarrays that correspond to zero elements of one of the arrays?

I have some data in 3 arrays with shapes:
docLengths.shape = (10000,)
docIds.shape = (10000,)
docCounts.shape = (68,10000)
I want to obtain relative counts and their means and standard deviations for some i:
docRelCounts = docCounts/docLengths
relCountMeans = docRelCounts[i,:].mean()
relCountDeviations = docRelCounts[i,:].std()
Problem is, some elements of docLengths are zero. This produces NaN elements in docRelCounts and the means and deviations are thus also NaN.
I need to remove the data for documents of zero length. I could write a loop, locating zero length doc's and removing them, but I was hoping for some numpy array magic that would do this more efficiently. Any ideas?
Try this:
docRelCounts = docCounts/docLengths
goodDocRelCounts = docRelCounts[i,:][np.invert(np.isnan(docRelCounts[i,:]))]
relCountMeans = goodDocRelCounts.mean()
relCountDeviations = goodDocRelCounts.std()
np.isnan returns an array of the same shape with True where original array is NaN, False elsewhere. And np.invert inverts this and then you get goodDocRelCounts with only the values that are not NaN.
Use nanmean and nanstd from scipy.stats:
from scipy.stats import nanmean, nanstd
In the end I did this (I'd actually worked it out before I saw eumiro's answer - it's a bit simpler, but otherwise not any better, just different, so I thought I'd include it :)
goodData = docLengths!=0 # find zero elements
docLen = docLen[goodData]
docCounts = docCounts[:,goodData]
docRelCounts = docCounts/docLen
means = map(lambda x:x.mean(), docRelCounts)
stds = map(lambda x:x.std(), docRelCounts)

Categories