Related
I have the following task to solve.
I have an image (numpy array) where everything that is not the main object is 0 and the main object has some pixel counts all around (let's set all of them to 1).
What I need is to get the number of all the pixels on the contour (red squares with 1 as the value) of this object. The objects can have different forms.
Is there any way to achieve it?
OBS: The goal is to have a method that would be able to adapt to the shape of the figure, because it would be run on multiple images simultaneously.
I propose a similar solution to #user2640045 using convolution.
We can slide a filter over the array that counts the number of neighbours (left, right, top, bottom):
import numpy as np
from scipy import signal
a = np.array(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
]
)
filter = np.array([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
Now we convolve the image array with the filter and :
conv = signal.convolve2d(a, filter, mode='same')
Every element that has more than zero and less than four neighbors while being active itself is a boundary element:
bounds = a * np.logical_and(conv > 0, conv < 4)
We can apply this mask to get the boundary pixels and sum them up:
>>> a[bounds].sum()
8
Here are 2 example inputs:
This is interesting and I got an elegant solution for you.
Since we can agree that contour is defined as np.array value that is greater than 0 and have at least 1 neighbor with a value of 0 we can solve it pretty stright forward and make sure it is ready for every single image you will get for life (in an Numpy array, of course...)
import numpy as np
image_pxs = np.array([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]])
def get_contour(two_d_arr):
contour_pxs = 0
# Iterate of np:
for i, row in enumerate(two_d_arr):
for j, pixel in enumerate(row):
# Check neighbors
up = two_d_arr[i-1][j] == 0 if i > 0 else True
down = two_d_arr[i+1][j] == 0 if i < len(image_pxs)-1 else True
left = two_d_arr[i][j-1] == 0 if j > 0 else True
right = two_d_arr[i][j+1] == 0 if j < len(row)-1 else True
# Count distinct neighbors (empty / not empty)
sub_contour = len(list(set([up, down, left, right])))
# If at least 1 neighbor is empty and current value > 0 it is the contour
if sub_contour > 1 and pixel > 0:
# Add the number of pixels in i, j
contour_pxs += pixel
return contour_pxs
print(get_contour(image_pxs))
The output is of course 8:
8
[Finished in 97ms]
I have a Python matrix array for example like this one:
a = array([[0, 2, 1, 1.4142, 4, 7],
[3, 0, 1.4142, 9, 2, 0],
[1.4142, 0, 0, 1, 1, 3]])
I want to convert all the elements of this array being different to 1 or different to sqrt(2) (1.4142) to 0. That is:
a = array([[0, 0, 1, 1.4142, 0, 0],
[0, 0, 1.4142, 0, 0, 0],
[1.4142, 0, 0, 1, 1, 0]])
I have tried this
a[(a != 1).any() or not (np.isclose(a, np.sqrt(2))).any()] = 0
and some variations but I can't make it to work. Thx.
Just use masking -
m1 = np.isclose(a,1) # use a==1 for exact matches
m2 = np.isclose(a,np.sqrt(2))
a[~(m1 | m2)] = 0
You can try it:
np.where((a == 1.4142), a, a == 1)
why not to check sum and product of elements for both arrays? correct if I am wrong this should work for positive numbers.
This is my code:
def string2bin(s):
y = []
for x in s:
q = []
q.append(str(bin(ord(x))[2:].zfill(8)))
y.append(q)
return y
It is supposed to output:
string2bin('abc')
[[0, 1, 1, 0, 0, 0, 0, 1], [0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 0, 1, 1]]
But instead outputs:
string2bin('abc')
[['01100001'], ['01100010'], ['01100011']]
Also how do you segment a string?
Thanks for any help!
Convert the string to a list of digits.
def string2bin(s):
y = []
for x in s:
q = [int(c) for c in (str(bin(ord(x))[2:].fill(8)))]
y.append(q)
return y
Probably the easiest edit to your code would be to simply change the append from string to list. As you are adding it to a list as a string, because the way ord works you are simply adding the string as a whole, not as individual values.
def string2bin(s):
y = []
for x in s:
q = []
q.append(list(bin(ord(x))[2:].zfill(8)))
y.append(q)
return y
Try this:
def string2bin(s):
return [list(map(int, str(bin(ord(x)))[2:].zfill(8))) for x in s]
print(string2bin('abc'))
It produces (up to indentation):
[[0, 1, 1, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 0, 0, 1, 1]]
You just missed the map(int, ...) and list(...) part.
So you're taking 1 char at a time out of the string then converting it to bin, removing the 0b and then filling in the rest with 0's to keep the length at 8.
The issue is this in turn gives you a string to append, not an array.
Here's my solution, not the most Pythonic but it's easy to read:
def string2bin(s):
y = []
for x in s:
q = []
adjusted = list(str(bin(ord(x)))[2:].zfill(8))
for num in adjusted:
q.append(int(num))
y.append(q)
return y
print(string2bin('abc'))
This outputs exactly what you've requested:
[[0, 1, 1, 0, 0, 0, 0, 1], [0, 1, 1, 0, 0, 0, 1, 0], [0, 1, 1, 0, 0, 0, 1, 1]]
I have a sparse matrix (numpy.array) and I would like to have the index of the nonzero elements in it.
In Matlab I would write:
[i, j] = find(CM)
and in Python what should I do?
I have tried numpy.nonzero (but I don't know how to take the indices from that) and flatnonzero (but it's not convenient for me, I need both the row and column index).
Thanks in advance!
Assuming that by "sparse matrix" you don't actually mean a scipy.sparse matrix, but merely a numpy.ndarray with relatively few nonzero entries, then I think nonzero is exactly what you're looking for. Starting from an array:
>>> a = (np.random.random((5,5)) < 0.10)*1
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
nonzero returns the indices (here x and y) where the nonzero entries live:
>>> a.nonzero()
(array([1, 2, 3]), array([4, 2, 0]))
We can assign these to i and j:
>>> i, j = a.nonzero()
We can also use them to index back into a, which should give us only 1s:
>>> a[i,j]
array([1, 1, 1])
We can even modify a using these indices:
>>> a[i,j] = 2
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 2],
[0, 0, 2, 0, 0],
[2, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
If you want a combined array from the indices, you can do that too:
>>> np.array(a.nonzero()).T
array([[1, 4],
[2, 2],
[3, 0]])
(there are lots of ways to do this reshaping; I chose one almost at random.)
This goes slightly beyond what you as and I only mention it since I once faced a similar problem. If you want the indices to access some other array there is some very simple sytax:
import numpy as np
array = np.random.randint(0, 2, size=(3, 3))
data = np.random.random(size=(3, 3))
Now array looks something like
>>> print array
array([[0, 1, 0],
[1, 0, 1],
[1, 1, 0]])
while data could be
>>> print data
array([[ 0.92824816, 0.43605604, 0.16627849],
[ 0.00301434, 0.94342538, 0.95297402],
[ 0.32665135, 0.03504204, 0.86902492]])
Then if we want the elements of data which are zero:
>>> print data[array==0]
array([ 0.92824816, 0.16627849, 0.94342538, 0.86902492])
Which is nice and simple.
I'm working with a very large sparse matrix multiplication (matmul) problem. As an example let's say:
A is a binary ( 75 x 200,000 ) matrix. It's sparse, so I'm using csc for storage. I need to do the following matmul operation:
B = A.transpose() * A
The output is going to be a sparse and symmetric matrix of size 200Kx200K.
Unfortunately, B is going to be way to large to store in RAM (or "in core") on my laptop. On the other hand, I'm lucky because there are some properties to B that should solve this problem.
Since B is going to be symmetric along the diagonal and sparse, I could use a triangular matrix (upper/lower) to store the results of the matmul operation and a sparse matrix storage format could further reduce the size.
My question is...can numpy or scipy be told, ahead of time, what the output storage requirements are going to look like so that I can select a storage solution using numpy and avoid the "matrix is too big" runtime error after several minutes (hours) of calculation?
In other words, can storage requirements for the matrix multiply be approximated by analyzing the contents of the two input matrices using an approximate counting algorithm?
https://en.wikipedia.org/wiki/Approximate_counting_algorithm
If not, I'm looking into a brute force solution. Something involving map/reduce, out-of-core storage, or a matmul subdivision solution (strassens algorithm) from the following web links:
A couple Map/Reduce problem subdivision solutions
http://www.norstad.org/matrix-multiply/index.html
http://bpgergo.blogspot.com/2011/08/matrix-multiplication-in-python.html
A out-of-core (PyTables) storage solution
Very large matrices using Python and NumPy
A matmul subdivision solution:
https://en.wikipedia.org/wiki/Strassen_algorithm
http://facultyfp.salisbury.edu/taanastasio/COSC490/Fall03/Lectures/FoxMM/example.pdf
http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/
Thanks in advance for any recommendations, comments, or guidance!
Since you are after the product of a matrix with its transpose, the value at [m, n] is basically going to be the dot product of columns m and n in your original matrix.
I am going to use the following matrix as a toy example
a = np.array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])
>>> np.dot(a.T, a)
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2]])
It is of shape (3, 12) and has 7 non-zero entries. The product of its transpose with it is of course of shape (12, 12) and has 16 non-zero entries, 6 of it in the diagonal, so it only requires storage of 11 elements.
You can get a good idea of what the size of your output matrix is going to be in one of two ways:
CSR FORMAT
If your original matrix has C non-zero columns, your new matrix will have at most C**2 non-zero entries, of which C are in the diagonal, and are assured not to be zero, and of the remaining entries you only need to keep half, so that is at most (C**2 + C) / 2 non-zero elements. Of course, many of these will also be zero, so this is probably a gross overestimate.
If your matrix is stored in csr format, then the indices attribute of the corresponding scipy object has an array with the column indices of all non zero elements, so you can easily compute the above estimate as:
>>> a_csr = scipy.sparse.csr_matrix(a)
>>> a_csr.indices
array([ 2, 11, 1, 7, 10, 4, 11])
>>> np.unique(a_csr.indices).shape[0]
6
So there are 6 columns with non-zero entries, and so the estimate would be for at most 36 non-zero entries, way more than the real 16.
CSC FORMAT
If instead of column indices of non-zero elements we have row indices, we can actually do a better estimate. For the dot product of two columns to be non-zero, they must have a non-zero element in the same row. If there are R non-zero elements in a given row, they will contribute R**2 non-zero elements to the product. When you sum this for all rows, you are bound to count some elements more than once, so this is also an upper bound.
The row indices of the non-zero elements of your matrix are in the indices attribute of a sparse csc matrix, so this estimate can be computed as follows:
>>> a_csc = scipy.sparse.csc_matrix(a)
>>> a_csc.indices
array([1, 0, 2, 1, 1, 0, 2])
>>> rows, where = np.unique(a_csc.indices, return_inverse=True)
>>> where = np.bincount(where)
>>> rows
array([0, 1, 2])
>>> where
array([2, 3, 2])
>>> np.sum(where**2)
17
This is darn close to the real 16! And it is actually not a coincidence that this estimate is actually the same as:
>>> np.sum(np.dot(a.T,a),axis=None)
17
In any case, the following code should allow you to see that the estimation is pretty good:
def estimate(a) :
a_csc = scipy.sparse.csc_matrix(a)
_, where = np.unique(a_csc.indices, return_inverse=True)
where = np.bincount(where)
return np.sum(where**2)
def test(shape=(10,1000), count=100) :
a = np.zeros(np.prod(shape), dtype=int)
a[np.random.randint(np.prod(shape), size=count)] = 1
print 'a non-zero = {0}'.format(np.sum(a))
a = a.reshape(shape)
print 'a.T * a non-zero = {0}'.format(np.flatnonzero(np.dot(a.T,
a)).shape[0])
print 'csc estimate = {0}'.format(estimate(a))
>>> test(count=100)
a non-zero = 100
a.T * a non-zero = 1065
csc estimate = 1072
>>> test(count=200)
a non-zero = 199
a.T * a non-zero = 4056
csc estimate = 4079
>>> test(count=50)
a non-zero = 50
a.T * a non-zero = 293
csc estimate = 294