Find indices of each integer group in a labelled array - python

I have a labelled array obtained by using scipy measure.label on a binary 2 dimensional array. For argument sake it might look like this:
[
[1,1,0,0,2],
[1,1,1,0,2],
[1,0,0,0,0],
[0,0,0,3,3]
]
I want to get the indices of each group of labels. So in this case:
[
[(0,0),(0,1),(1,0),(1,1),(1,2),(2,0)],
[(0,4),(1,4)],
[(3,3),(3,4)]
]
I can do this using builtin Python like so (n and m are the dimensions of the array):
_dict = {}
for coords in itertools.product(range(n), range(m)):
_dict.setdefault(labelled_array[coords], []).append(coords)
blobs = [np.array(item) for item in _dict.values()]
This is very slow (about 10 times slower than the initial labelling of the binary array using measure.label!)
Scipy also has a function find_objects:
from scipy import ndimage
objs = ndimage.find_objects(labelled_array)
From what I can gather though this is returning the bounding box for each group (object). I don't want the bounding box I want the exact coordinates of each value in the group.
I have also tried using np.where for each integer in the number of labels. This is very slow.
it also seems to me that what I'm tring to do here is something like the minesweeper algorithm. I suspect there must be an efficient solution using numpy or scipy.
Is there an efficient way to obtain these coordinates?

Related

Grouping a numpy array

I have a huge NumPy array of size 778. I would like to pair the elements hence I'm using the following code to do so.
coordinates = coordinates.reshape(-1, 2,2)
However, if I use the following code it just works fine.
coordinates = coordinates[:len(coordinates)-1].reshape(-1, 2,2)
How can I do this in a proper way irrespective of the size?

What is the fastest way to read in an image to an array of tuples?

I am trying to assign provinces to an area for use in a game mod. I have two separate maps for area and provinces.
provinces file,
area file.
Currently I am reading in an image in Python and storing it in an array using PIL like this:
import PIL
land_prov_pic = Image.open(INPUT_FILES_DIR + land_prov_str)
land_prov_array = np.array(land_prov_pic)
image_size = land_prov_pic.size
for x in range(image_size[0]):
if x % 100 == 0:
print(x)
for y in range(image_size[1]):
land_prov_array[x][y] = land_prov_pic.getpixel((x,y))
Where you end up with land_prov_array[x][y] = (R,G,B)
However, this get's really slow, especially for large images. I tried reading it in using opencv like this:
import opencv
land_prov_array = cv2.imread(INPUT_FILES_DIR + land_prov_str)
land_prov_array = cv2.cvtColor(land_prov_array, cv2.COLOR_BGR2RGB) #Convert from BGR to RGB
But now land_prov_array[x][y] = [R G B] which is an ndarray and can't be inserted into a set. But it's way faster than the previous for loop. How do I convert [R G B] to (R,G,B) for every element in the array without for loops or, better yet, read it in that way?
EDIT: Added pictures, more description, and code blocks for readability.
It is best to convert the [R,G,B] array to tuple when you need it to be a tuple, rather than converting the whole image to this form. An array of tuples takes up a lot more memory, and will be a lot slower to process, than a numeric array.
The answer by isCzech shows how to create a NumPy view over a 3D array that presents the data as if it were a 2D array of tuples. This might not require the additional memory of an actual array of tuples, but it is still a lot slower to process.
Most importantly, most NumPy functions (such as np.mean) and operators (such as +) cannot be applied to such an array. Thus, one is obliged to iterate over the array in Python code (or with a #np.vectorize function), which is a lot less efficient than using NumPy functions and operators that work on the array as a whole.
For transformation from a 3D array (data3D) to a 2D array (data2D), I've used this approach:
import numpy as np
dt = np.dtype([('x', 'u1'), ('y', 'u1'), ('z', 'u1')])
data2D = data3D.view(dtype=dt).squeeze()
The .view modifies the data type and returns still a 3D array with the last dimension of size 1 which can be then removed by .squeeze. Alternatively you can use .squeeze(axis=-1) to only squeeze the last dimension (in case some of your other dimensions are of size 1 too).
Please note I've used uint8 ('u1') - your type may be different.
Trying to do this using a loop is very slow, indeed (compared to this approach at least).
Similar question here: Show a 2d numpy array where contents are tuples as an image

Given a 2D Numpy array representing a 2D distribution, how to sample data from this distribution with the aid of Numpy or Scipy functions?

Given a 2D numpy array dist with shape (200,200), where each entry of the array represents the joint probability of (x1, x2) for all x1 , x2 ∈ {0, 1, . . . , 199}. How do I sample bivariate data x = (x1, x2) from this probability distribution with the aid of Numpy or Scipy API?
This solution works with probability distributions of any number of dimensions, assuming they are a valid probability distribution (its contents must sum to 1, etc.). It flattens the distribution, samples from that, and adjusts the random index to match the original array shape.
# Create a flat copy of the array
flat = array.flatten()
# Then, sample an index from the 1D array with the
# probability distribution from the original array
sample_index = np.random.choice(a=flat.size, p=flat)
# Take this index and adjust it so it matches the original array
adjusted_index = np.unravel_index(sample_index, array.shape)
print(adjusted_index)
Also, to get multiple samples, add a size keyword argument to the np.random.choice call, and modify adjusted_index before printing it:
adjusted_index = np.array(zip(*adjusted_index))
This is necessary because np.random.choice with a size argument outputs a list of indices for each coordinate dimension, so this zips them into a list of coordinate tuples. This is also much more efficient than simply repeating the first code.
Relevant documentation:
np.random.choice
np.unravel_index
Here's a way, but I'm sure there's a much more elegant solution using scipy.
numpy.random doesn't deal with 2d pmfs, so you have to do some reshaping gymnastics to go this way.
import numpy as np
# construct a toy joint pmf
dist=np.random.random(size=(200,200)) # here's your joint pmf
dist/=dist.sum() # it has to be normalized
# generate the set of all x,y pairs represented by the pmf
pairs=np.indices(dimensions=(200,200)).T # here are all of the x,y pairs
# make n random selections from the flattened pmf without replacement
# whether you want replacement depends on your application
n=50
inds=np.random.choice(np.arange(200**2),p=dist.reshape(-1),size=n,replace=False)
# inds is the set of n randomly chosen indicies into the flattened dist array...
# therefore the random x,y selections
# come from selecting the associated elements
# from the flattened pairs array
selections = pairs.reshape(-1,2)[inds]
I can't comment either, but #applemonkey496 's suggestion for getting multiple samples doesn't work as written. It's an excellent solution otherwise.
Instead of
adjusted_index = np.array(zip(*adjusted_index))
adjusted_index should be converted to a python list before trying to put it into a numpy array (numpy arrays do not accept zipped objects), eg:
adjusted_index = np.array(list(zip(*adjusted_index)))
I can't comment, but to improve kevinkayaks answer's :
pairs=np.indices(dimensions=(200,200)).T
selections = pairs.reshape(-1,2)[inds]
Is not needed can be replace by :
np.array([inds//m, inds%m]).T
The matrix "pairs" is not needed anymore.

How to generate a number of random vectors starting from a given one

I have an array of values and would like to create a matrix from that, where each row is my starting point vector multiplied by a sample from a (normal) distribution.
The number of rows of this matrix will then vary in dependence from the number of samples I want.
%pylab
my_vec = array([1,2,3])
my_rand_vec = my_vec*randn(100)
Last command does not work, because array shapes do not match.
I could think of using a for loop, but I am trying to leverage on array operations.
Try this
my_rand_vec = my_vec[None,:]*randn(100)[:,None]
For small numbers I get for example
import numpy as np
my_vec = np.array([1,2,3])
my_rand_vec = my_vec[None,:]*np.random.randn(5)[:,None]
my_rand_vec
# array([[ 0.45422416, 0.90844831, 1.36267247],
# [-0.80639766, -1.61279531, -2.41919297],
# [ 0.34203295, 0.6840659 , 1.02609885],
# [-0.55246431, -1.10492863, -1.65739294],
# [-0.83023829, -1.66047658, -2.49071486]])
Your solution my_vec*rand(100) does not work because * corresponds to the element-wise multiplication which only works if both arrays have identical shapes.
What you have to do is adding an additional dimension using [None,:] and [:,None] such that numpy's broadcasting works.
As a side note I would recommend not to use pylab. Instead, use import as in order to include modules as pointed out here.
It is the outer product of vectors:
my_rand_vec = numpy.outer(randn(100), my_vec)
You can pass the dimensions of the array you require to numpy.random.randn:
my_rand_vec = my_vec*np.random.randn(100,3)
To multiply each vector by the same random number, you need to add an extra axis:
my_rand_vec = my_vec*np.random.randn(100)[:,np.newaxis]

Fastest way to Iterate a Matrix with vectors as entries in numpy

I'm using a function in python's opencv library to get the light flow movement of my hand as I move it around. Specifically http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowfarneback
This function outputs a numpy array
flow = cv2.calcOpticalFlowFarneback(prevgray, gray, 0.5, 3, 15, 3, 5, 1.2, 0)
print flow.shape # prints (480,320,2)
So flow is a matrix with each entry a vector. I want a way to quantify this matrix so I though of using the L1 Matrix norm (numpy.linalg.norm(flow, 1)) Which throws a improper dimensions to norm error.
I'm thinking about getting around this by calculating the euclidean norm of every vector and then finding the L1 norm of a matrix with the distances of the vectors.
I'm having trouble iterating through the flow matrix efficiently. I have done it using two for loops by going first through columns and then rows, but it's way too slow.
r,c,d = flow.shape
flowprime = numpy.zeros((r,c),flow.dtype)
for i in range(0,r):
for j in range (0,c):
flowprime[i,j] = numpy.linalg.norm(flow[i,j], 2)
print(numpy.linalg.norm(flowprime, 1))
I had also tried using numpy.nditer but
for x in numpy.nditer(flow, op_flags=['readwrite']):
print x
just prints a single value rather than a vector.
What would be the fastest way to iterate through a numpy matrix with vectors as entries, norm them and then take the L1 norm?
As of numpy version 1.9, norm takes an axis argument.
Aside from that, say what you want ideally, and almost surely you can ask numpy to do it. E.g., assuming no complex entries or missing values, the simplest case np.sqrt((flow**2).sum()) or the case I think you describe np.linalg.norm(np.sqrt((flow**2).sum(axis=-1)),1).

Categories