count the number of occurance of each one hot code - python

I have a list of numpy arrays (one-hot represantation) like the example bellow, I want to count the number of occurances of each one-hot code.
[0 0 1 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 1 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
Edit :
Expected output :
[1 0 0 0 0 0 0 0 0 0] ==> 1 occurrence
[0 0 1 0 0 0 0 0 0 0] ==> 2 occurrences
[0 1 0 0 0 0 0 0 0 0] ==> 3 occurrences
[0 0 0 0 0 1 0 0 0 0] ==> 1 occurrence
[0 0 0 0 1 0 0 0 0 0] ==> 2 occurrences
[0 0 0 0 0 0 0 0 0 1] ==> 2 occurrences

I think you can get the result you seek:
[1 3 2 1 2 1 0 0 0 2]
indicating the count of occurrences of one hot in that position via a simple column-wise sum using ndarray.sum():
import numpy
data = numpy.array([
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
])
print(numpy.ndarray.sum(data, axis=0))
or more compactly as just:
print(data.sum(axis=0))
both should give you:
[1 3 2 1 2 1 0 0 0 2]

Using the face that each row is 1 hot, you can do the following:
temp = np.array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0 ,0 ,0 ,1 ,0 ,0 ,0 ,0 ,0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
converting the one-hot to indices can be done as follows:
temp2 = np.argmax(temp, axis=1) # array([2, 2, 1, 5, 1, 4, 9, 4, 0, 3, 1, 9])
and then the counting of the occurances can be done using np.histogram. We know that you have 10 possible values, so we use 10 bins as follows:
temp3 = np.histogram(temp2, bins=10, range=(-0.5,9.5))
np.histogram returns a touple where index [0] holds the histogram values and index [1] holds the bins. In your case:
(array([1, 3, 2, 1, 2, 1, 0, 0, 0, 2]),
array([-0.5, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]))

Related

How to join matrices like puzzle pieces in python

I've got three puzzle pieces defined as a number of arrays, 7x7, in a following manner:
R3LRU = pd.DataFrame([
[1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1]
])
I am trying to join them by the following rules: 1111111 can be joined with 1000001, 1000001 can be joined with 1000001, but 1111111 cannot be joined with 1111111. Better illustration will be the following:
I have tried using pd.concat function, but it just glues them together instead of joining by sides, like this:
Or, in terms of code output, like this:
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1
1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
3 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
4 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
5 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
6 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I suppose I would like to join by columns 6 and 0, or rows 6 and 0
How can I define "joining" sides, so that the pieces would join through the proposed rules?
I take it you want to concatenate if the last column and first columns match and then "overlap" both parts. I dont think, pandas is a good fit for this problem as you only need values, no columns or basically any features you would use pandas for.
I would recommend simple numpy arrays. Then you could do something like
In [1]: import numpy as np
In [2]: R3LRU = np.array([
...: [1, 1, 1, 1, 1, 1, 1],
...: [1, 0, 0, 0, 0, 0, 1],
...: [1, 0, 0, 0, 0, 0, 1],
...: [1, 0, 0, 0, 0, 0, 1],
...: [1, 0, 0, 0, 0, 0, 1],
...: [1, 0, 0, 0, 0, 0, 1],
...: [1, 0, 0, 0, 0, 0, 1]
...: ])
In [3]: R3LRU
Out[3]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1]])
Get the last column of the first part and the first column of the second part
In [4]: R3LRU[:,0]
Out[4]: array([1, 1, 1, 1, 1, 1, 1])
In [5]: R3LRU[:,-1]
Out[5]: array([1, 1, 1, 1, 1, 1, 1])
Compare them
In [6]: R3LRU[:,0] == R3LRU[:,-1]
Out[6]: array([ True, True, True, True, True, True, True])
In [7]: np.all(R3LRU[:,0] == R3LRU[:,-1])
Out[7]: True
If they are equal, combine them
In [8]: if np.all(R3LRU[:,0] == R3LRU[:,-1]):
...: combined = np.hstack([R3LRU[:,:-1], R3LRU])
In [9]: combined
Out[9]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]])
Maybe your rules are a bit more complicated than a simple == comparison, but you can just make that if statement more complicated to reflect all rules you have ;)

How to one-hot-encode sentences at the character level?

I would like to convert a sentence to an array of one-hot vector.
These vector would be the one-hot representation of the alphabet.
It would look like the following:
"hello" # h=7, e=4 l=11 o=14
would become
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Unfortunately OneHotEncoder from sklearn does not take as input string.
Just compare the letters in your passed string to a given alphabet:
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
vector = [[0 if char != letter else 1 for char in alphabet]
for letter in strng]
return vector
Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).
The output of string_vectorizer('hello'):
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.
alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]
# #divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97
# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')
one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()
In [15]: one_hot.eval()
Out[15]:
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
With pandas, you can use pd.get_dummies by passing a categorical Series:
import pandas as pd
import string
low = string.ascii_lowercase
pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out:
a b c d e f g h i j ... q r s t u v w x y z
0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
[5 rows x 26 columns]
Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -
ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)
If you are looking for performance, I would suggest using an initialized array and then assign -
out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1
Sample run -
In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97
In [154]: ints
Out[154]: array([ 7, 4, 11, 11, 14], dtype=uint8)
In [155]: out = (ints[:,None] == np.arange(26)).astype(int)
In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]
You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:
def onehot(ltr):
return [1 if i==ord(ltr) else 0 for i in range(97,123)]
def onehotvec(s):
return [onehot(c) for c in list(s.lower())]
onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Converting an array of numpy arrays to DataFrame

I have a numpy object that contains the following:
17506 [0, 0, 0, 0, 0, 0]
17507 [0, 0, 0, 0, 0, 0]
17508 [0, 0, 0, 0, 0, 0]
17509 [0, 0, 0, 0, 0, 0]
17510 [0, 0, 0, 0, 0, 0]
17511 [0, 0, 0, 0, 0, 0]
17512 [0, 0, 0, 0, 0, 0]
17513 [0, 0, 0, 0, 0, 0]
17514 [0, 0, 0, 0, 0, 0]
17515 [0, 0, 0, 0, 0, 0]
17516 [0, 0, 0, 0, 0, 0]
17517 [0, 0, 0, 0, 0, 0]
17518 [0, 0, 0, 0, 0, 0]
17519 [0, 0, 0, 0, 0, 0]
(An array that contains arrays of dtype('int32'))
How can I efficiently convert this to data frame in pandas and concantenate it (vertically) to an existing dataframe?
What seems to be the problem? You may need to further describe your data.
a = np.array([np.zeros(6) for _ in range(3)])
>>> pd.DataFrame(a)
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0

How to find cluster sizes in 2D numpy array?

My problem is the following,
I have a 2D numpy array filled with 0 an 1, with an absorbing boundary condition (all the outer elements are 0) , for example:
[[0 0 0 0 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 1 0 1 0 0 0 1 0]
[0 0 0 0 0 0 1 0 1 0]
[0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 1 0 1 0 0 0]
[0 0 0 0 0 1 1 0 0 0]
[0 0 0 1 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]]
I want to create a function that takes this array and its linear dimension L as input parameters, (in this case L = 10) and returns the list of cluster sizes of this array.
By "clusters" I mean the isolated groups of elements 1 of the array
the array element [ i ][ j ] is isolated if all its neighbours are zeros, and its neighbours are the elements:
[i+1][j]
[i-1][j]
[i][j+1]
[i][j-1]
So in the previous array we have 7 clusters of sizes (2,1,2,6,1,1,1)
I tried to complete this task by creating two functions, the first one is a recursive function:
def clust_size(array,i,j):
count = 0
if array[i][j] == 1:
array[i][j] = 0
if array[i-1][j] == 1:
count += 1
array[i-1][j] = 0
clust_size(array,i-1,j)
elif array[i][j-1] == 1:
count += 1
array[i-1][j] = 0
clust_size(array,i,j-1)
elif array[i+1][j] == 1:
count += 1
array[i-1][j] = 0
clust_size(array,i+1,j)
elif array[i][j+1] == 1:
count += 1
array[i-1][j] = 0
clust_size(array,i,j+1)
return count+1
and it should return the size of one cluster. Everytime the function finds an array element equal to 1 it increases the value of the counter "count" and changes the value of the element to 0, in this way each '1' element it's counted just one time.
If one of the neighbours of the element is equal to 1 then the function calls itself on that element.
The second function is:
def clust_list(array,L):
sizes_list = []
for i in range(1,L-1):
for i in range(1,L-1):
count = clust_size(array,i,j)
sizes_list.append(count)
return sizes_list
and it should return the list containing the cluster sizes. The for loop iterates from 1 to L-1 because all the outer elements are 0.
This doesn't work and I can't see where the error is...
I was wondering if maybe there's an easier way to do it.
it seems like a percolation problem.
The following link has your answer if you have scipy installed.
http://dragly.org/2013/03/25/working-with-percolation-clusters-in-python/
from pylab import *
from scipy.ndimage import measurements
z2 = array([[0,0,0,0,0,0,0,0,0,0],
[0,0,1,0,0,0,0,0,0,0],
[0,0,1,0,1,0,0,0,1,0],
[0,0,0,0,0,0,1,0,1,0],
[0,0,0,0,0,0,1,0,0,0],
[0,0,0,0,1,0,1,0,0,0],
[0,0,0,0,0,1,1,0,0,0],
[0,0,0,1,0,1,0,0,0,0],
[0,0,0,0,1,0,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0]])
This will identify the clusters:
lw, num = measurements.label(z2)
print lw
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 2, 0, 0, 0, 3, 0],
[0, 0, 0, 0, 0, 0, 4, 0, 3, 0],
[0, 0, 0, 0, 0, 0, 4, 0, 0, 0],
[0, 0, 0, 0, 5, 0, 4, 0, 0, 0],
[0, 0, 0, 0, 0, 4, 4, 0, 0, 0],
[0, 0, 0, 6, 0, 4, 0, 0, 0, 0],
[0, 0, 0, 0, 7, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
The following will calculate their area.
area = measurements.sum(z2, lw, index=arange(lw.max() + 1))
print area
[ 0. 2. 1. 2. 6. 1. 1. 1.]
This gives what you expect, although I would think that you would have a cluster with 8 members by eye-percolation.
I feel your problem with finding "clusters", is essentially the same problem of finding connected components in a binary image (with values of either 0 or 1) based on 4-connectivity. You can see several algorithms to identify the connected components (or "clusters" as you defined them) in this Wikipedia page:
http://en.wikipedia.org/wiki/Connected-component_labeling
Once the connected components or "clusters" are labelled, you can find any information you want easily, including the area, relative position or any other information you may want.
I believe that your way ist almost correct, except that you are initializing the variable count over and over again whenever you recursively call your function clust_size. I would add the count variable to the input parameters of clust_size and just reinitialize it for every first call in your nested for loops with count = 0.
Like this, you would call clust_size always like count=clust_size(array, i ,j, count)
I haven't tested it but it seems to me that it should work.
Hope it helps.
A relatively simple problem if you convert this to strings
import numpy as np
arr=np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0,],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0,],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 0,], #modified
[0, 0, 0, 0, 0, 0, 1, 0, 1, 0,],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0,],
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0,],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0,],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0,],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0,],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
arr = "".join([str(x) for x in arr.reshape(-1)])
print [len(x) for x in arr.replace("0"," ").split()]
output
[1, 7, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1] #Cluster sizes

Python : How to fill an array line by line?

I have an issue with numpy that I can't solve.
I have 3D arrays (x,y,z) filled with 0 and 1.
For instance, one slice in the z axis :
array([[1, 0, 1, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[1, 0, 1, 1, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 0, 1]])
And I want this result :
array([[1, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 1]])
That is to say, what I want to do for each slice z is to scan line by line right to left and left to right (x axis) and the first time I have a 1 I want to fill the rest of the line with ones.
Is there an efficient way to compute that ?
Thanks a lot.
Nico !
Accessing NumPy array elements one by one is not very efficient. You may do better with just plain Python lists. They also have an index method which can search for the first entry of the value in the list.
from numpy import *
a = array([[1, 0, 1, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[1, 0, 1, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 0, 1]])
def idx_front(ln):
try:
return list(ln).index(1)
except ValueError:
return len(ln) # an index beyond line end
def idx_back(ln):
try:
return len(ln) - list(reversed(ln)).index(1) - 1
except ValueError:
return len(ln) # an index beyond line end
ranges = [ (idx_front(ln), idx_back(ln)) for ln in a ]
for ln, (lo,hi) in zip(a, ranges):
ln[lo:hi] = 1 # attention: destructive update in-place
print "ranges =", ranges
print a
Output:
ranges = [(0, 5), (2, 6), (0, 7), (1, 6), (0, 7), (0, 7), (4, 4), (8, 8), (2, 7)]
[[1 1 1 1 1 1 0 0]
[0 0 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1]
[0 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1]
[0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0]
[0 0 1 1 1 1 1 1]]
Actually, this is a basic binary image morphology operation.
You can do it in one step for the entire 3D array using scipy.ndimage.morphology.binary_fill_holes
You just need a slightly different structure element. In a nutshell, you want a structuring element that looks like this for the 2D case:
[[0, 0, 0],
[1, 1, 1],
[0, 0, 0]]
Here's a quick example:
import numpy as np
import scipy.ndimage as ndimage
a = np.array( [[1, 0, 1, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[1, 0, 1, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 0, 1]])
structure = np.zeros((3,3), dtype=np.int)
structure[1,:] = 1
filled = ndimage.morphology.binary_fill_holes(a, structure)
print filled.astype(np.int)
This yields:
[[1 1 1 1 1 1 0 0]
[0 0 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1]
[0 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0]
[0 0 1 1 1 1 1 1]]
The real advantage to this (Other than speed... It will be much faster and more memory efficient than using lists!) is that it will work just as well for 3D, 4D, 5D, etc arrays.
We just need to adjust the structuring element to match the number of dimensions.
import numpy as np
import scipy.ndimage as ndimage
# Generate some random 3D data to match what we want...
x = (np.random.random((10,10,20)) + 0.5).astype(np.int)
# Make the structure (I'm assuming that "z" is the _last_ dimension!)
structure = np.zeros((3,3,3))
structure[1,:,1] = 1
filled = ndimage.morphology.binary_fill_holes(x, structure)
print x[:,:,5]
print filled[:,:,5].astype(np.int)
Here's a slice from the random input 3D array:
[[1 0 1 0 1 1 0 1 0 0]
[1 0 1 1 0 1 0 1 0 0]
[1 0 0 1 0 1 1 1 1 0]
[0 0 0 1 1 0 1 0 0 0]
[1 0 1 0 1 0 0 1 1 0]
[1 0 1 1 0 1 0 0 0 1]
[0 1 0 1 0 0 1 0 1 0]
[0 1 1 0 1 0 0 0 0 1]
[0 0 0 1 1 1 1 1 0 1]
[1 0 1 1 1 1 0 0 0 1]]
And here's the filled version:
[[1 1 1 1 1 1 1 1 0 0]
[1 1 1 1 1 1 1 1 0 0]
[1 1 1 1 1 1 1 1 1 0]
[0 0 0 1 1 1 1 0 0 0]
[1 1 1 1 1 1 1 1 1 0]
[1 1 1 1 1 1 1 1 1 1]
[0 1 1 1 1 1 1 1 1 0]
[0 1 1 1 1 1 1 1 1 1]
[0 0 0 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]]
The key difference here is that we did this for every slice of the entire 3D array in one step.
After a moments thought, following your description and corner case with all zero rows, this will be still quite straightforward with numpylike:
In []: A
Out[]:
array([[1, 0, 1, 0, 1, 1, 0, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[1, 0, 1, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 1, 0, 1, 0],
[1, 1, 1, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 1, 1, 0, 1]])
In []: v= 0< A.sum(1) # work only with rows at least one 1
In []: A_v= A[v, :]
In []: (r, s), a= A_v.nonzero(), arange(v.sum())
In []: se= c_[searchsorted(r, a), searchsorted(r, a, side= 'right')- 1]
In []: for k in a: A_v[k, s[se[k, 0]]: s[se[k, 1]]]= 1
..:
In []: A[v, :]= A_v
In []: A
Out[]:
array([[1, 1, 1, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 1]])
Update:
After some more tinkering, here is a more 'pythonic' implementation and way much simpler, than the above one. So, the following lines:
for k in xrange(A.shape[0]):
m= A[k].nonzero()[0]
try: A[k, m[0]: m[-1]]= 1
except IndexError: continue
are quite straightforward ones. And they'll perform very well, indeed.
I can't think of a more efficient way than what you describe:
For every line
Scan line from the left until you find a 1.
If no 1 is find continue with next line.
Otherwise scan from the right to find the last 1 in the line.
Fill everything in the current line between the positions from 1. and 3. with 1s.

Categories