Count number of "True" values in boolean Tensor - python

I understand that tf.where will return the locations of True values, so that I could use the result's shape[0] to get the number of Trues.
However, when I try and use this, the dimension is unknown (which makes sense as it needs to be computed at runtime). So my question is, how can I access a dimension and use it in an operation like a sum?
For example:
myOtherTensor = tf.constant([[True, True], [False, True]])
myTensor = tf.where(myOtherTensor)
myTensor.get_shape() #=> [None, 2]
sum = 0
sum += myTensor.get_shape().as_list()[0] # Well defined at runtime but considered None until then.

You can cast the values to floats and compute the sum on them:
tf.reduce_sum(tf.cast(myOtherTensor, tf.float32))
Depending on your actual use case you can also compute sums per row/column if you specify the reduce dimensions of the call.

I think this is the easiest way to do it:
In [38]: myOtherTensor = tf.constant([[True, True], [False, True]])
In [39]: if_true = tf.count_nonzero(myOtherTensor)
In [40]: sess.run(if_true)
Out[40]: 3

Rafal's answer is almost certainly the simplest way to count the number of true elements in your tensor, but the other part of your question asked:
[H]ow can I access a dimension and use it in an operation like a sum?
To do this, you can use TensorFlow's shape-related operations, which act on the runtime value of the tensor. For example, tf.size(t) produces a scalar Tensor containing the number of elements in t, and tf.shape(t) produces a 1D Tensor containing the size of t in each dimension.
Using these operators, your program could also be written as:
myOtherTensor = tf.constant([[True, True], [False, True]])
myTensor = tf.where(myOtherTensor)
countTrue = tf.shape(myTensor)[0] # Size of `myTensor` in the 0th dimension.
sess = tf.Session()
sum = sess.run(countTrue)

There is a tensorflow function to count non-zero values tf.count_nonzero. The function also accepts an axis and keep_dims arguments.
Here is a simple example:
import numpy as np
import tensorflow as tf
a = tf.constant(np.random.random(100))
with tf.Session() as sess:
print(sess.run(tf.count_nonzero(tf.greater(a, 0.5))))

Related

Build a new array from an existing using a boolean mask

I have created a boolean mask, say mask, which I want to apply to an existing array, say old to create an entirely new one, say new, which retains only the non zero elements. The new array should then have a smaller dimension with respect to old.
Can some one suggest me the fastest and more coincise way, without using, if possible, the numpy.append function?
Say you have:
old = np.array([2,4,3,5,6])
mask = [True, False, True, False, False]
Simply do:
new = old[mask]
print(new)
[2 3]
I suggest you read about Boolean or “mask” index arrays
Just use logical indexing
x = x[x!=0]

How do I shift my thinking to 'vectorize my computation' more than using 'for-loops'?

This is definitely more of a notional question, but I wanted to get others expertise input on this topic at SO. Most of my programming is coming from Numpy arrays lately. I've been matching items in two or so arrays that are different in sizes. Most of the time I will go to a for-loop or even worst, nested for-loop. I'm ultimately trying to avoid using for-loops as I try to gain more experience in Data Science because for-loops perform slower.
I am well aware of Numpy and the pre-defined cmds I can research, but for those of you whom are experienced, do you have a general school of thought when you iterate through something?
Something similar to the following:
small_array = np.array(["a", "b"])
big_array = np.array(["a", "b", "c", "d"])
for i in range(len(small_array)):
for p in range(len(big_array)):
if small_array[i] == big_array[p]:
print "This item is matched: ", small_array[i]
I'm well aware there are more than one way to skin a cat with this, but I am interested in others approach and way of thinking.
Since I've been working with array languages for decades (APL, MATLAB, numpy) I can't help with the starting steps. But I suspect I work mostly from patterns, things I've seen and used in the past. And I do a lot to experimentation in an interactive session.
To take your example:
In [273]: small_array = np.array(["a", "b"])
...: big_array = np.array(["a", "b", "c", "d"])
...:
...: for i in range(len(small_array)):
...: for p in range(len(big_array)):
...: if small_array[i] == big_array[p]:
...: print( "This item is matched: ", small_array[i])
...:
This item is matched: a
This item is matched: b
Often I run the iterative case just to get a clear(er) idea of what is desired.
In [274]: small_array
Out[274]:
array(['a', 'b'],
dtype='<U1')
In [275]: big_array
Out[275]:
array(['a', 'b', 'c', 'd'],
dtype='<U1')
I've seen this before - iterating over two arrays, and doing something with the paired values. This is a kind of outer operation. There are various tools, but the one I like best makes use of numpy broadcasting. It turn one array into a (n,1) array, and use it with the other (m,) array
In [276]: small_array[:,None]
Out[276]:
array([['a'],
['b']],
dtype='<U1')
The result of (n,1) operating with (1,m) is a (n,m) array:
In [277]: small_array[:,None]==big_array
Out[277]:
array([[ True, False, False, False],
[False, True, False, False]], dtype=bool)
Now I can take an any or all reduction on either axis:
In [278]: _.all(axis=0)
Out[278]: array([False, False, False, False], dtype=bool)
In [280]: __.all(axis=1)
Out[280]: array([False, False], dtype=bool)
I could also use np.where to reduce that boolean to indices.
Oops, I should have used any
In [284]: (small_array[:,None]==big_array).any(0)
Out[284]: array([ True, True, False, False], dtype=bool)
In [285]: (small_array[:,None]==big_array).any(1)
Out[285]: array([ True, True], dtype=bool)
Having played with this I remember that there's a in1d that does something similar
In [286]: np.in1d(big_array, small_array)
Out[286]: array([ True, True, False, False], dtype=bool)
But when I look at the code for in1d (see the [source] link in the docs), I see that, in some cases it actually iterates on the small array:
In [288]: for x in small_array:
...: print(x==big_array)
...:
[ True False False False]
[False True False False]
Compare that to Out[277]. x==big_array compares a scalar with an array. In numpy, doing something like ==, +, * etc with an array and scalar is easy, and should become second nature. Doing the same thing with 2 arrays of matching shapes is the next step. And from there do it with broadcastable shapes.
In other cases it use np.unique and np.argsort.
This pattern of creating a higher dimension array by broadcasting the inputs against each other, and then combining values with some sort of reduction (any, all, sum, mean, etc) is very common.
I will interpret your question in a more specific way:
How do I quit using index variables?
How do I start writing list comprehensions instead of normal loops"?
To quit using index variables, the key is to understand that "for" in Python is not the "for" of other languagues. It should be called "for each".
for x in small_array:
for y in big_array:
if x == y:
print "This item is matched: ", x
That's much better.
I also find myself in situations where I would write code with normal loops (or actually do it) and then start wondering whether it would be clearer and more elegant with a list comprehension.
List comprehensions are really a domain-specific language to create lists, so the first step would be to learn its basics. A typical statement would be:
l = [f(x) for x in list_expression if g(x)]
Meaning "give me a list of f(x), for all x out of list_expression that meet condition g"
So you could write it in this way:
matched = [x for x in small_array if x in big_array]
Et voilà, you are on the road to pythonic style!
As you said, you better use vectorized stuff to speed up. Learning it is a long path. You have to get used with matrices multiplication if you aren't already. Once you are, try to translate your data into matrix and see which multiplication you can do. Usually you can't do what you want with this and have super-matrices (more than 2D dimensions). That's where numpy get useful.
Numpy provides some functions like np.where, know how to use them. Know shortcuts like small_array[small_array == 'a'] = 'z'. Try to combine numpy functions with nativ pythons (map, filter...).
To handle multi-dimension matrix, there's no seccret, practice and use paper to understand what you're doing. But over 4 dimensions it starts getting very tricky.
For loops are not necessarily slow. That's a matlab nonsense spread through time because of matlab's own fault. Vectorization is "for" looping but in a lower level. You need to get a handle on what kind of data and architecture you are working in and which kind of function your are executing over your data.

Replacing NumPy array entries with their frequencies / values from dictionary

Problem: From two input arrays, I want to output an array with the frequency of True values (from input_2) corresponding to each value of input_1.
import numpy as np # import everything from numpy
from scipy.stats import itemfreq
input_1 = np.array([3,6,6,3,6,4])
input_2 = np.array([False, True, True, False, False, True])
For this example output that I want is:
output_1 = np.array([0,2,2,0,2,1])
My current approach involves editing input_1, so only the values corresponding to True remain:
locs=np.where(input_2==True,input_1,0)
Then counting the frequency of each answer, creating a dictionary and replacing the appropriate keys of input_1 to values (the True frequencies).
loc_freq = itemfreq(locs)
dic = {}
for key,val in loc_freq:
dic[key]=val
print dic
for k, v in dic.iteritems():
input_1[input_1==k]=v
which outputs [3,2,2,3,2,1].
The problem here is twofold:
1) this still does not do anything with the keys that are not in the dictionary (and should therefore be changed to 0). For example, how can I get the 3s transformed into 0s?
2) This seems very inelegant / ineffective. Is there a better way to approach this?
np.bincount is what you are looking for.
output_1 = np.bincount(input_1[input_2])[input_1]
#memecs solution is correct, +1. However it will be very slow and take a lot of memory if the values in input_1 are really large, i.e. they are not indices of an array, but say they are seconds or some other integer data that can take very large values.
In that case, you have that np.bincount(input_1[input_2]).size is equal to the largest integer in input_1 with a True value in input_2.
It is much faster to use unique and bincount. We use the first to extract the indices of the unique elements of input_1, and then use bincount to count how often these indices appear in that same array, and weigh them 1 or 0 based on the value of the array input_2 (True or False):
# extract unique elements and the indices to reconstruct the array
unq, idx = np.unique(input_1, return_inverse=True)
# calculate the weighted frequencies of these indices
freqs_idx = np.bincount(idx, weights=input_2)
# reconstruct the array of frequencies of the elements
frequencies = freqs_idx[idx]
print(frequencies)
This solution is really fast and has the minimum memory impact. Credit goes to #Jaime, see his comment below. Below I report my original answer, using unique in a different manner.
OTHER POSSIBILITY
It may be faster to go for another solution, using unique:
import numpy as np
input_1 = np.array([3, 6, 6, 3, 6, 4])
input_2 = np.array([False, True, True, False, False, True])
non_zero_hits, counts = np.unique(input_1[input_2], return_counts=True)
all_hits, idx = np.unique(input_1, return_inverse=True)
frequencies = np.zeros_like(all_hits)
#2nd step, with broadcasting
idx_non_zero_hits_in_all_hits = np.where(non_zero_hits[:, np.newaxis] - all_hits == 0)[1]
frequencies[idx_non_zero_hits_in_all_hits] = counts
print(frequencies[idx])
This has the drawback that it will require a lot of memory if the number of unique elements in input_1 with a True value in input_2 are many, because of the 2D array created and passed to where. To reduce the memory footprint, you could use a for loop instead for the 2nd step of the algorithm:
#2nd step, but with a for loop.
for j, val in enumerate(non_zero_hits):
index = np.where(val == all_hits)[0]
frequencies[index] = counts[j]
print(frequencies[idx])
This second solution has a very small memory footprint, but requires a for loop. It depends on your typical data input size and values which solution will be best.
The currently accepted bincount solution is quite elegant, but the numpy_indexed package provides more general solutions to problems of this kind:
import numpy_indexed as npi
idx = npi.as_index(input_1)
unique_labels, true_count_per_label = npi.group_by(idx).sum(input_2)
print(true_count_per_label[idx.inverse])

Numpy/Python: Array iteration without for-loop

So it's another n-dimensional array question:
I want to be able to compare each value in an n-dimensional arrays with its neighbours. For example if a is the array which is 2-dimensional i want to be able to check:
a[y][x]==a[y+1][x]
for all elements. So basically check all neighbours in all dimensions. Right now I'm doing it via:
for x in range(1,a.shape[0]-1):
do.something(a[x])
The shape of the array is used, so that I don't run into an index out of range at the edges. So if I want to do something like this in n-D for all elements in the array, I do need n for-loops which seems to be untidy. Is there a way to do so via slicing? Something like a==a[:,-1,:] or am I understanding this fully wrong? And is there a way to tell a slice to stop at the end? Or would there be another idea of getting things to work in a totally other way? Masked arrays?
Greets Joni
Something like:
a = np.array([1,2,3,4,4,5])
a == np.roll(a,1)
which returns
array([False, False, False, False, True, False], dtype=bool
You can specify an axis too for higher dimensions, though as others have said you'll need to handle the edges somehow as the values wrap around (as you can guess from the name)
For a fuller example in 2D:
# generate 2d data
a = np.array((np.random.rand(5,5)) * 10, dtype=np.uint8)
# check all neighbours
for ax in range(len(a.shape)):
for i in [-1,1]:
print a == np.roll(a, i, axis=ax)
This might also be useful, this will compare each element to the following element, along axis=1. You can obviously adjust the axis or the distance. The trick is to make sure that both sides of the == operator have the same shape.
a[:, :-1, :] == a[:, 1:, :]
How about just:
np.diff(a) != 0
?
If you need the neighbours in the other axis, maybe diff the result of np.swapaxes(a) and merge the results together somehow ?

How to find slice object in numpy array

I have a numpy array containing integers and slice objects, e.g.:
x = np.array([0,slice(None)])
How do I retrieve the (logical) indices of the integers or slice objects? I tried np.isfinite(x) (producing an error), np.isreal(x) (all True), np.isscalar(x) (not element-wise), all in vain.
What seems to work though is
ind = x<np.Inf # Out[1]: array([True, False], dtype=bool)
but I'm reluctant to use a numerical comparison on an object who's numerical value is completely arbitrary (and might change in the future?). Is there a better solution to achieve this?
You can do this:
import numpy as np
checker = np.vectorize( lambda x: isinstance(x,slice) )
x = np.array([0,slice(None),slice(None),0,0,slice(None)])
checker(x)
#array([False, True, True, False, False, True], dtype=bool)

Categories