Concatenating, sorting, and re-partitioning xyz data - python

I have a situation where I have two lists of [x, y, z] data, I want to concatenate these lists, sort them, then extract a matrix for the z values, with x increasing along the columns, and y increasing along the rows.
To give an example:
list1 = np.linspace(-2,2,3)
list2 = np.linspace(-1,1,3)
dat1 = []
for x in list1:
for y in list1:
z = x * y
dat1 += [[x,y,z]]
dat1 = np.array(dat1)
dat2 = []
for x in list2:
for y in list2:
z = x * y
dat2 += [[x,y,z]]
dat2 = np.array(dat2)
I can build an array from the z values for each of these list individually using:
dat1[:, 2].reshape((list1.shape[0],list1.shape[0]))
but I want an (ordered) array for all values from both lists, i.e. I want to do the same thing with full sorted data set:
dat_full=np.vstack((dat1, dat2))
dat_index = np.lexsort((dat_full[:,1], dat_full[:,0]))
dat_sorted = dat_full[dat_index]
the problem is that this is not a square array anymore, so I can't use the simple reshape trick I used previously. Is there a good way to do this?
Edit:
I should clarify that I am only interested in the unique data in concatenated array, which can be found using:
dat_full=np.unique(np.vstack((dat1, dat2)))
dat_index = np.lexsort((dat_full[:,1], dat_full[:,0]))
dat_sorted = dat_full[dat_index]

Like markuscosinus said, the problem with this is that you would need a "matrix" with varying row and column sizes, which cannot be done in NumPy. The alternative that you may consider, however, is using a masked array, if you can work with that. That will allow you to have all the values in the same array and masking the "gaps" as invalid. For example, you could do that like this (I have changed how you create dat1 and dat2 but the result is the same):
import numpy as np
list1 = np.linspace(-2, 2, 3)
list2 = np.linspace(-1, 1, 3)
# Evaluate using grids instead of loops
xg1, yg1 = np.meshgrid(list1, list1, indexing='ij')
x1, y1 = xg1.ravel(), yg1.ravel()
xg2, yg2 = np.meshgrid(list2, list2, indexing='ij')
x2, y2 = xg2.ravel(), yg2.ravel()
dat1 = np.stack([x1, y1, x1 * y1], axis=-1)
dat2 = np.stack([x2, y2, x2 * y2], axis=-1)
# Full dataset
dat_full = np.concatenate([dat1, dat2])
# Remove repeated rows
_, idx = np.unique(dat_full, return_index=True, axis=0)
dat_uniq = dat_full[idx]
# Find unique X and Y values
_, x_idx, x_counts = np.unique(dat_uniq[:, 0], return_inverse=True, return_counts=True)
_, y_idx, y_counts = np.unique(dat_uniq[:, 1], return_inverse=True, return_counts=True)
# Make array as big as the most repeated index
result = np.zeros((x_counts.max(), y_counts.max()), dtype=dat_full.dtype)
# Make mask for array
mask = np.ones_like(result, dtype=bool)
# Fill array and mask
result[x_idx, y_idx] = dat_uniq[:, 2]
mask[x_idx, y_idx] = False
# Make masked array
result = np.ma.masked_array(result, mask)
print(result)
Output:
[[4.0 -- -0.0 -- -4.0]
[-- 1.0 -0.0 -1.0 --]
[-0.0 -0.0 0.0 0.0 0.0]
[-- -1.0 0.0 1.0 --]
[-4.0 -- 0.0 -- 4.0]]

My approach would be
result = []
_, occurences = np.unique(dat_sorted[:,0], return_inverse=True)
for i in range(np.max(occurences) + 1):
result.append(dat_sorted[occurences == i, 2])
This will give you a x value ordered list of y value ordered arrays of z values. This is not a matrix because there are x values occuring more often than others, resulting in different sized arrays.

Related

efficient way to check every value in a 2d python array

I have a 2D numpy array of values, a list of x-coordinates, and a list of y-coordinates. the x-coordinates increase left-to-right and the y-coordinates increase top-to-bottom.
For example:
a = np.random.random((3, 3))
a[0][1] = 9.0
a[0][2] = 9.0
a[1][1] = 9.0
a[1][2] = 9.0
xs = list(range(1112, 1115))
ys = list(range(1109, 1112))
Output:
[[0.48148651 9. 9. ]
[0.09030393 9. 9. ]
[0.79271224 0.83413552 0.29724989]]
[1112, 1113, 1114]
[1109, 1110, 1111]
I want to remove the values from the 2D array that are greater than 1. I also want to combine the lists xs and ys to get a list of all the coordinate pairs for points that are kept.
In this example I want to remove a[0][1], a[0][2], a[1][1], a[1][2] and I want the list of coordinate pairs to be
[[1112, 1109], [1112,1110], [1112, 1111], [1113, 1111], [1114, 1111]]
I have been able to accomplish this using a double for loop and if statements:
a_values = []
point_pairs = []
for i in range(0, a.shape[0]):
for j in range(0, a.shape[1]):
if (a[i][j] < 1):
a_values.append(a[i][j])
point_pairs.append([xs[j], ys[i]])
print(a_values)
print(point_pairs)
Output:
[0.48148650831317796, 0.09030392566133771, 0.7927122386213029, 0.8341355206494774, 0.2972498933037804]
[[1112, 1109], [1112, 1110], [1112, 1111], [1113, 1111], [1114, 1111]]
What is a more efficient way of doing this?
You can use np.nonzero to get the indices of the elements you removed:
mask = a < 1
i, j = np.nonzero(mask)
The fancy indices i and j can be used to get the elements of xs and ys directly if they are numpy arrays:
xs = np.array(xs)
ys = np.array(ys)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
You can also use np.take to make the conversion happen under the hood:
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
The remaining elements of a are those not covered by the mask:
a_points = a[mask]
Alternatively:
i, j = np.nonzero(a < 1)
point_pairs = np.stack((np.take(xs, j), np.take(ys, i)), axis=-1)
a_points = a[i, j]
In this context, you can use np.where as a drop-in alias for np.nonzero.
Notes
If you are using numpy, there is rarely a need for lists. Putting xs = np.array(xs), or even just initializing it as xs = np.arange(1112, 1115) is faster and easier.
Numpy arrays should generally be indexed through a single index: a[0, 1], not a[0][1]. For your simple case, the behavior just happens to be the same, but it will not be in the general case. a[0, 1] is an index into the original array. a[0] is a view of the first row of the array, i.e., a separate array object. a[0][1] is an index into that new object. You just happened to get lucky that you are getting a view that shares the base memory, so the assignment is visible in a itself. This would not be the case if you tried a mask or fancy index, for example.
On a related note, setting a rectangular swath in an array only requires one line: a[1:, :-1] = 9.
I would write your example something like this:
a = np.random.random((3, 3))
a[1:, :-1] = 9.0
xs = np.arange(1112, 1115)
ys = np.arange(1109, 1112)
i, j = np.nonzero(a < 1)
point_pairs = np.stack((xs[j], ys[i]), axis=-1)
a_points = a[i, j]

Calculate mean, variance, covariance of different length matrices in a split list

I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.

Memory-efficient sparse symmetric matrix calculations

I have to perform a large number of such calculations:
X.dot(Y).dot(Xt)
X = 1 x n matrix
Y = symmetric n x n matrix, each element one of 5 values (e.g. 0, 0.25, 0.5, 0.75, 1)
Xt = n x 1 matrix, transpose of X, i.e. X[np.newaxis].T
X and Y are dense. The problem I have is for large n, I cannot store and use matrix Y due to memory issues. I am limited to using one machine, so distributed calculations are not an option.
It occurred to me that Y has 2 features which theoretically can reduce the amount of memory required to store Y:
Elements of Y are covered by a small list of values.
Y is symmetric.
How can I implement this in practice? I have looked up storage of symmetric matrices, but as far as I am aware all numpy matrix multiplications require "unpacking" the symmetry to produce a regular n x n matrix.
I understand numpy is designed for in-memory calculations, so I'm open to alternative python-based solutions not reliant on numpy. I'm also open to sacrificing speed for memory-efficiency.
UPDATE: I found using a format that crams 3 matrix elements into one byte is actually quite fast. In the example below the speed penalty is less than 2x compared to direct multiplication using # while the space saving is more than 20x.
>>> Y = np.random.randint(0, 5, (3000, 3000), dtype = np.int8)
>>> i, j = np.triu_indices(3000, 1)
>>> Y[i, j] = Y[j, i]
>>> values = np.array([0.3, 0.5, 0.6, 0.9, 2.0])
>>> Ycmp = (np.reshape(Y, (1000, 3, 3000)) * np.array([25, 5, 1], dtype=np.int8)[None, :, None]).sum(axis=1, dtype=np.int8)
>>>
>>> full = values[Y]
>>> x # full # x
1972379.8153972814
>>>
>>> vtable = values[np.transpose(np.unravel_index(np.arange(125), (5,5,5)))]
>>> np.dot(np.concatenate([(vtable * np.bincount(row, x, minlength=125)[:, None]).sum(axis=0) for row in Ycmp]), x)
1972379.8153972814
>>>
>>> timeit('x # full # x', globals=globals(), number=100)
0.7130507210385986
>>> timeit('np.dot(np.concatenate([(vtable * np.bincount(row, x, minlength=125)[:, None]).sum(axis=0) for row in Ycmp]), x)', globals=globals(), number=100)
1.3755558349657804
The solutions below are slower and less memory efficient. I'll leave them merely for reference.
If you can afford half a byte per matrix element, then you can use np.bincount like so:
>>> Y = np.random.randint(0, 5, (1000, 1000), dtype = np.int8)
>>> i, j = np.triu_indices(1000, 1)
>>> Y[i, j] = Y[j, i]
>>> values = np.array([0.3, 0.5, 0.6, 0.9, 2.0])
>>> full = values[Y]
>>> # full would correspond to your original matrix,
>>> # Y is the 'compressed' version
>>>
>>> x = np.random.random((1000,))
>>>
>>> # direct method for reference
>>> x # full # x
217515.13954751115
>>>
>>> # memory saving version
>>> np.dot([(values * np.bincount(row, x)).sum() for row in Y], x)
217515.13954751118
>>>
>>> # to save another almost 50% exploit symmetry
>>> upper = Y[i, j]
>>> diag = np.diagonal(Y)
>>>
>>> boundaries = np.r_[0, np.cumsum(np.arange(999, 0, -1))]
>>> (values*np.bincount(diag, x*x)).sum() + 2 * np.dot([(values*np.bincount(upper[boundaries[i]:boundaries[i+1]], x[i+1:],minlength=5)).sum() for i in range(999)], x[:-1])
217515.13954751115
Each row of Y, if represented as a numpy.array of datatype int as suggested in #PaulPanzer's answer, can be compressed to occupy less memory: In fact, you can store 27 elements with 64 bit, because 64 / log2(5) = 27.56...
First, compress:
import numpy as np
row = np.random.randint(5, size=100)
# pad with zeros to length that is multiple of 27
if len(row)%27:
row_pad = np.append(row, np.zeros(27 - len(row)%27, dtype=int))
else:
row_pad = row
row_compr = []
y_compr = 0
for i, y in enumerate(row_pad):
if i > 0 and i % 27 == 0:
row_compr.append(y_compr)
y_compr = 0
y_compr *= 5
y_compr += y
# append last
row_compr.append(y_compr)
row_compr = np.array(row_compr, dtype=np.int64)
Then, decompress:
row_decompr = []
for y_compr in row_compr:
y_block = np.zeros(shape=27, dtype=np.uint8)
for i in range(27):
y_block[26-i] = y_compr % 5
y_compr = int(y_compr // 5)
row_decompr.append(y_block)
row_decompr = np.array(row_decompr).flatten()[:len(row)]
The decompressed array coincides with the original row of Y:
assert np.allclose(row, row_decompr)

Vectorizing an operation between all pairs of elements in two numpy arrays

Given two arrays where each row represents a circle (x, y, r):
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
I would like to pull out all of the pairs of circles that are not disjointed. This can be done by:
close_data = {}
for row1 in data[1]: #loop over first array
for row2 in data[2]: #loop over second array
condition = ((abs(row1[0]-row2[0]) + abs(row1[1]-row2[1])) < (row1[2]+row2[2]))
if condition: #circles overlap if true
if tuple(row1) not in close_data.keys():
close_data[tuple(row1)] = [row1, row2] #pull out close data points
else:
close_data[tuple(row1)].append(row2)
for k, v in close_data.iteritems():
print k, v
#desired outcome
#(455.108, 97.047799999999995, 0.012245333299999999)
#[array([ 4.55108000e+02, 9.70478000e+01, 1.22453333e-02]),
# array([ 4.55103000e+02, 9.70473000e+01, 1.2040000e-02]),
# array([ 4.55106000e+02, 9.70490000e+01, 1.50472381e-02])]
However the multiple loops over the arrays are very inefficient for large datasets. Is it possible to vectorize the calculations so I get the advantage of using numpy?
The most difficult bit is actually getting to your representation of the info. Oh, and I inserted a few squares. If you really don't want Euclidean distances you have to change back.
import numpy as np
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
d1 = data[1][:, None, :]
d2 = data[2][None, :, :]
dists2 = ((d1[..., :2] - d2[..., :2])**2).sum(axis = -1)
radss2 = (d1[..., 2] + d2[..., 2])**2
inds1, inds2 = np.where(dists2 <= radss2)
# translate to your representation:
bnds = np.r_[np.searchsorted(inds1, np.arange(3)), len(inds1)]
rows = [data[2][inds2[bnds[i]:bnds[i+1]]] for i in range(3)]
out = dict([(tuple (data[1][i]), rows[i]) for i in range(3) if rows[i].size > 0])
Here is a pure numpythonic way (a is data[1] and b is data[2]):
In [80]: p = np.arange(3) # for creating the indices of combinations using np.tile and np.repeat
In [81]: a = a[np.repeat(p, 3)] # creates the first column of combination array
In [82]: b = b[np.tile(p, 3)] # creates the second column of combination array
In [83]: abs(a[:, :2] - b[:, :2]).sum(1) < a[:, 2] + b[:, 2]
Out[83]: array([ True, False, True, True, False, True, True, False, True], dtype=bool)

adding numpy arrays of differing shapes

I'd like to add two numpy arrays of different shapes, but without broadcasting, rather the "missing" values are treated as zeros. Probably easiest with an example like
[1, 2, 3] + [2] -> [3, 2, 3]
or
[1, 2, 3] + [[2], [1]] -> [[3, 2, 3], [1, 0, 0]]
I do not know the shapes in advance.
I'm messing around with the output of np.shape for each, trying to find the smallest shape which holds both of them, embedding each in a zero-ed array of that shape and then adding them. But it seems rather a lot of work, is there an easier way?
Thanks in advance!
edit: by "a lot of work" I meant "a lot of work for me" rather than for the machine, I seek elegance rather than efficiency: my effort getting the smallest shape holding them both is
def pad(a, b) :
sa, sb = map(np.shape, [a, b])
N = np.max([len(sa),len(sb)])
sap, sbp = map(lambda x : x + (1,)*(N-len(x)), [sa, sb])
sp = np.amax( np.array([ tuple(sap), tuple(sbp) ]), 1)
not pretty :-/
I'm messing around with the output of np.shape for each, trying to find the smallest shape which holds both of them, embedding each in a zero-ed array of that shape and then adding them. But it seems rather a lot of work, is there an easier way?
Getting the np.shape is trivial, finding the smallest shape that holds both is very easy, and of course adding is trivial, so the only "a lot of work" part is the "embedding each in a zero-ed array of that shape".
And yes, you can eliminate that, by just calling the resize method (or the resize function, if you want to make copies instead of changing them in-place). As the docs explain:
Enlarging an array: … missing entries are filled with zeros
For example, if you know the dimensionality statically:
>>> a1 = np.array([[1, 2, 3], [4, 5, 6]])
>>> a2 = np.array([[2], [2]])
>>> shape = [max(a.shape[axis] for a in (a1, a2)) for axis in range(2)]
>>> a1.resize(shape)
>>> a2.resize(shape)
>>> print(a1 + a2)
array([[3, 4, 3],
[4, 5, 6]])
This is the best I could come up with:
import numpy as np
def magic_add(*args):
n = max(a.ndim for a in args)
args = [a.reshape((n - a.ndim)*(1,) + a.shape) for a in args]
shape = np.max([a.shape for a in args], 0)
result = np.zeros(shape)
for a in args:
idx = tuple(slice(i) for i in a.shape)
result[idx] += a
return result
You can clean up the for loop a little if you know how many dimensions you expect on result, something like:
for a in args:
i, j = a.shape
result[:i, :j] += a
You may try my solution - for dimension 1 arrays you have to expand your arrays to
dimension 2 (as shown in the example below), before passing it to the function.
import numpy as np
import timeit
matrix1 = np.array([[0,10],
[1,20],
[2,30]])
matrix2 = np.array([[0,10],
[1,20],
[2,30],
[3,40]])
matrix3 = np.arange(0,0,dtype=int) # empty numpy-array
matrix3.shape = (0,2) # reshape to 0 rows
matrix4 = np.array([[0,10,100,1000],
[1,20,200,2000]])
matrix5 = np.arange(0,4000,1)
matrix5 = np.reshape(matrix5,(4,1000))
matrix6 = np.arange(0.0,4000,0.5)
matrix6 = np.reshape(matrix6,(20,400))
matrix1 = np.array([1,2,3])
matrix1 = np.expand_dims(matrix1, axis=0)
matrix2 = np.array([2,1])
matrix2 = np.expand_dims(matrix2, axis=0)
def add_2d_matrices(m1, m2, pos=(0,0), filler=None):
"""
Add two 2d matrices of different sizes or shapes,
offset by xy coordinates, whereat x is "from left to right" (=axis:1)
and y is "from top to bottom" (=axis:0)
Parameterse:
- m1: first matrix
- m2: second matrix
- pos: tuple (x,y) containing coordinates for m2 offset,
- filler: gaps are filled with the value of filler (or zeros)
Returns:
- 2d array (float):
containing filler-values, m1-values, m2-values
or the sum of m1,m2 (at overlapping areas)
Author:
Reinhard Daemon, Austria
"""
# determine shape of final array:
_m1 = np.copy(m1)
_m2 = np.copy(m2)
x,y = pos
y1,x1 = _m1.shape
y2,x2 = _m2.shape
xmax = max(x1, x2+x)
ymax = max(y1, y2+y)
# fill-up _m1 array with zeros:
y1,x1 = _m1.shape
diff = xmax - x1
_z = np.zeros((y1,diff))
_m1 = np.hstack((_m1,_z))
y1,x1 = _m1.shape
diff = ymax - y1
_z = np.zeros((diff,x1))
_m1 = np.vstack((_m1,_z))
# shift _m2 array by 'pos' and fill-up with zeros:
y2,x2 = _m2.shape
_z = np.zeros((y2,x))
_m2 = np.hstack((_z,_m2))
y2,x2 = _m2.shape
diff = xmax - x2
_z = np.zeros((y2,diff))
_m2 = np.hstack((_m2,_z))
y2,x2 = _m2.shape
_z = np.zeros((y,x2))
_m2 = np.vstack((_z,_m2))
y2,x2 = _m2.shape
diff = ymax - y2
_z = np.zeros((diff,x2))
_m2 = np.vstack((_m2,_z))
# add the 2 arrays:
_m3 = _m1 + _m2
# find and fill the "unused" positions within the summed array:
if filler not in (None,0,0.0):
y1,x1 = m1.shape
y2,x2 = m2.shape
x1min = 0
x1max = x1-1
y1min = 0
y1max = y1-1
x2min = x
x2max = x + x2-1
y2min = y
y2max = y + y2-1
for xx in range(xmax):
for yy in range(ymax):
if x1min <= xx <= x1max and y1min <= yy <= y1max:
continue
if x2min <= xx <= x2max and y2min <= yy <= y2max:
continue
_m3[yy,xx] = filler
return(_m3)
t1 = timeit.Timer("add_2d_matrices(matrix5, matrix6, pos=(1,1), filler=111.111)", \
"from __main__ import add_2d_matrices,matrix5,matrix6")
print("ran:",t1.timeit(number=10), "milliseconds")
print("\n\n")
my_res = add_2d_matrices(matrix1, matrix2, pos=(1,1), filler=99.99)
print(my_res)

Categories