Numpy compare 2 array shape, if different, append 0 to match shape - python

I am comparing 2 numpy arrays, and want to add them together. but, before doing so, i need to make sure they are the same size. If the size are not same, then take the smaller sized one and fill the last rows with zero to match the shape.
Both array have 16 columns and N rows. I am assuming it should be pretty straight forward, but I can't get my head around it. So far I am able to compare the 2 array shape.
import csv
import numpy as np
import sys
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
This is the output I got:
=============New file.csv============
(603, 16)
(604, 16)
we have an error
I want the fill the last row of "data" array with 0 so that I can add the 2 arrays.
Thanks for your help.

You can use vstack(array1, array2) from numpy which stacks arrays vertically. For example:
A = np.random.randint(2, size = (2, 16))
B = np.random.randint(2, size = (5, 16))
print A.shape
print B.shape
if A.shape[0] < B.shape[0]:
A = np.vstack((A, np.zeros((B.shape[0] - A.shape[0], 16))))
elif A.shape[0] > B.shape[0]:
B = np.vstack((B, np.zeros((A.shape[0] - B.shape[0], 16))))
print A.shape
print A
In your case:
if data.shape[0] < data_sys.shape[0]:
data = np.vstack((data, np.zeros((data_sys.shape[0] - data.shape[0], 16))))
elif data.shape[0] > data_sys.shape[0]:
data_sys = np.vstack((data_sys, np.zeros((data.shape[0] - data_sys.shape[0], 16))))
I assume that your matrices have always the same number of columns, if not you can similarly use hstack to stack them horizontally.

If you have only two files, and their shapes differ in just the 0th dimension, a simple check and copy is probably easiest, though it lacks generality:
import numpy as np
data = np.genfromtxt('./test1.csv', dtype=float, delimiter=',')
data_sys = np.genfromtxt('./test2.csv', dtype=float, delimiter=',')
fill_value = 0 # could be np.nan or something else instead
if data.shape[0]>data_sys.shape[0]:
temp = data_sys
data_sys = np.ones(data.shape)*fill_value
data_sys[:temp.shape[0],:] = temp
elif data.shape[0]<data_sys.shape[0]:
temp = data
data = np.ones(data_sys.shape)*fill_value
data[:temp.shape[0],:] = temp
print 'Using conditional:'
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
A much more general solution is a custom class--overkill for your two files but much easier if you have lots of files to handle. The basic idea is that static class variables sx and sy keep track of the largest widths and heights, and are used when get_data is called, to output a standard shape array. This is pre-filled with your desired fill value, and the actual data from the corresponding file are copied into the upper left corner of the standard shape array:
import numpy as np
class IsomorphicArray:
sy = 0 # static class variable
sx = 0 # static class variable
fill_value = 0.0
def __init__(self,csv_filename):
self.data = np.genfromtxt(csv_filename,dtype=float,delimiter=',')
self.instance_sy,self.instance_sx = self.data.shape
if self.instance_sy>IsomorphicArray.sy:
IsomorphicArray.sy = self.instance_sy
if self.instance_sx>IsomorphicArray.sx:
IsomorphicArray.sx = self.instance_sx
def get_data(self):
out = np.ones((IsomorphicArray.sy,IsomorphicArray.sx))*self.fill_value
out[:self.instance_sy,:self.instance_sx] = self.data
return out
isomorphic_array_list = []
for filename in ['./test1.csv','./test2.csv']:
isomorphic_array_list.append(IsomorphicArray(filename))
numpy_array_list = []
for isomorphic_array in isomorphic_array_list:
numpy_array_list.append(isomorphic_array.get_data())
print 'Using custom class:'
for numpy_array in numpy_array_list:
print numpy_array.shape

Assuming both arrays have 16 columns
len1=len(data)
len2=len(data_sys)
if len1<len2:
data=np.append(data, np.zeros((len2-len1, 16)),axis=0)
elif len2<len1:
data_sys=np.append(data_sys, np.zeros((len1-len2, 16)),axis=0)
print data.shape
print data_sys.shape
if data.shape != data_sys.shape:
print "we have an error"
else:
print "we r good"

Numpy provides an append function to add values to an array: see here for details. In multi-dimensional arrays you can define how the values should be added. As you have already the information which of your arrays is the smaller one, just add the desired number of zeroes with creating a zero filled array first by numpy.zeroes and then append it to your target array.
It might be necessary to flatten your array first and then to reshape it.

I had a similar situation. Two arrays of sizes mask_in:(n1,m1) and mask_ot:(n2,m2)that were generated through a mask of a 2D image of size (N,M) where A2 is larger than A1 and both share a common center (X0,Y0). I followed the approach suggested by #AniaG using vstack and hstack. I simply obtained the shapes of both arrays, size difference and finally account the number of missing elements at both ends.
Here is what I got:
mask_in = np.random.randint(2, size = (2, 8))
mask_ot = np.random.randint(2, size = (6, 16))
mask_in_amp = mask_in
dif_row = mask_ot.shape[0]-mask_in_amp.shape[0]
dif_col = mask_ot.shape[1]-mask_in_amp.shape[1]
complete_row = dif_row / 2
complete_col = dif_col / 2
mask_in_amp = np.vstack((mask_in_amp, np.zeros((complete_row, mask_in_amp.shape[1]))))
mask_in_amp = np.vstack((np.zeros((complete_row, mask_in_amp.data.shape[1])), mask_in_amp))
mask_in_amp = np.hstack((mask_in_amp, np.zeros((mask_in_amp.shape[0],complete_col))))
mask_in_amp = np.hstack((np.zeros((mask_in_amp.shape[0],complete_col)), mask_in_amp))

If you don't care about the exact shapes of two arrays you can also do the following:
if data.size == datasys.size:
print ('arrays have the same number of elements, and possibly shape')
else:
print ('arrays do not have the same shape for sure')

Related

Slicing 2D numpy array periodically

I have a numpy array of 300x300 where I want to keep all elements periodically. Specifically, for both axes I want to keep the first 5 elements, then discard 15, keep 5, discard 15, etc. This should result in an array of 75x75 elements. How can this be done?
You can created a 1D mask, that carries out the keep/discard function, and then repeat the mask and apply the mask to the array. Here is an example.
import numpy as np
size = 300
array = np.arange(size).reshape((size, 1)) * np.arange(size).reshape((1, size))
mask = np.concatenate((np.ones(5), np.zeros(15))).astype(bool)
period = len(mask)
mask = np.repeat(mask.reshape((1, period)), repeats=size // period, axis=0)
mask = np.concatenate(mask, axis=0)
result = array[mask][:, mask]
print(result.shape)
You can view the array as series of 20x20 blocks, of which you want to keep the upper-left 5x5 portion. Let's say you have
keep = 5
discard = 15
This only works if
assert all(s % (keep + discard) == 0 for s in arr.shape)
First compute the shape of the view and use it:
block = keep + discard
shape1 = (arr.shape[0] // block, block, arr.shape[1] // block, block)
view = arr.reshape(shape1)[:, :keep, :, :keep]
The following operation will create a copy of the data because the view creates a non-contiguous buffer:
shape2 = (shape1[0] * keep, shape1[2] * keep)
result = view.reshape(shape2)
You can compute shape1 and shape2 in a more general manner with something like
shape1 = tuple(
np.stack((np.array(arr.shape) // block,
np.full(arr.ndim, block)), -1).ravel())
shape2 = tuple(np.array(shape1[::2]) * keep)
I would recommend packaging this into a function.
Here is my first thought of a solution. Will update later if I think of one with fewer lines. This should work even if the input is not square:
output = []
for i in range(len(arr)):
tmp = []
if i % (15+5) < 5: # keep first 5, then discard next 15
for j in range(len(arr[i])):
if j % (15+5) < 5: # keep first 5, then discard next 15
tmp.append(arr[i,j])
output.append(tmp)
Update:
Building off of Yang's answer, here is another way which uses np.tile, which repeats an array a given number of times along each axis. This relies on the input array being square in dimension.
import numpy as np
# Define one instance of the keep/discard box
keep, discard = 5, 15
mask = np.concatenate([np.ones(keep), np.zeros(discard)])
mask_2d = mask.reshape((keep+discard,1)) * mask.reshape((1,keep+discard))
# Tile it out -- overshoot, then trim to match size
count = len(arr)//len(mask_2d) + 1
tiled = np.tile(mask_2d, [count,count]).astype('bool')
tiled = tiled[:len(arr), :len(arr)]
# Apply the mask to the input array
dim = sum(tiled[0])
output = arr[tiled].reshape((dim,dim))
Another option using meshgrid and a modulo:
# MyArray = 300x300 numpy array
r = np.r_[0:300] # A slide from 0->300
xv, yv = np.meshgrid(r, r) # x and y grid
mask = ((xv%20)<5) & ((yv%20)<5) # We create the boolean mask
result = MyArray[mask].reshape((75,75)) # We apply the mask and reshape the final output

How to collapse two array axis together of a numpy array?

Basic idea: I have an array of images images=np.array([10, 28, 28, 3]). So 10 images 28x28 pixels with 3 colour channels. I want to stitch them together in one long line: single_image.shape # [280, 28, 3]. What would be the best numpy based function for that?
More generally: is there a function along the lines of stitch(array, source_axis=0, target_axis=1) that would transform an array A.shape # [a0, a1, source_axis, a4, target_axis, a6] into a shape B.shape # [a0, a1, a4, target_axis*source_axis, a6] by concatenating subarrays A[:,:,i,:,:,:] along axis=target_axis
You can set it up with a single moveaxis + reshape combo -
def merge_axis(array, source_axis=0, target_axis=1):
shp = a.shape
L = shp[source_axis]*shp[target_axis] # merged axis len
out_shp = np.insert(np.delete(shp,(source_axis,target_axis)),target_axis-1,L)
return np.moveaxis(a,source_axis,target_axis-1).reshape(out_shp)
Alternatively, out_shp could be setup with array manipulations and might be easier to follow, like so -
shp = np.array(a.shape)
shp[target_axis] *= shp[source_axis]
out_shp = np.delete(shp,source_axis)
If source and target axes are adjacent ones, we can skip moveaxis and simply reshape and the additional benefit would be that the output would be a view into the input and hence virtually free on runtime. So, we will introduce a If-conditional to check and modify our implementations to something like these -
def merge_axis_v1(array, source_axis=0, target_axis=1):
shp = a.shape
L = shp[source_axis]*shp[target_axis] # merged_axis_len
out_shp = np.insert(np.delete(shp,(source_axis,target_axis)),target_axis-1,L)
if target_axis==source_axis+1:
return a.reshape(out_shp)
else:
return np.moveaxis(a,source_axis,target_axis-1).reshape(out_shp)
def merge_axis_v2(array, source_axis=0, target_axis=1):
shp = np.array(a.shape)
shp[target_axis] *= shp[source_axis]
out_shp = np.delete(shp,source_axis)
if target_axis==source_axis+1:
return a.reshape(out_shp)
else:
return np.moveaxis(a,source_axis,target_axis-1).reshape(out_shp)
Verify views -
In [156]: a = np.random.rand(10,10,10,10,10)
In [157]: np.shares_memory(merge_axis_v1(a, source_axis=0, target_axis=1),a)
Out[157]: True
Here is my take:
def merge_axis(array, source_axis=0, target_axis=1):
array = np.moveaxis(array, source_axis, 0)
array = np.moveaxis(array, target_axis, 1)
array = np.concatenate(array)
array = np.moveaxis(array, 0, target_axis-1)
return array

How to vectorize a code with python numpy.bincount, using apply along axis

I'm trying to vectorize a code with numpy, to run it using multiprocessing, but i can't understand how numpy.apply_along_axis works. This is an example of the code, vectorized using map
import numpy
from scipy import sparse
import multiprocessing
from matplotlib import pyplot
#first i build a matrix of some x positions vs time datas in a sparse format
matrix = numpy.random.randint(2, size = 100).astype(float).reshape(10,10)
x = numpy.nonzero(matrix)[0]
times = numpy.nonzero(matrix)[1]
weights = numpy.random.rand(x.size)
#then i define an array of y positions
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
#now i build an image using x-y-times coordinates and x-times weights
def mapIt(ithStep):
ncolumns = 80
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
image = list(map(mapIt, range(nStepsY)))
image = numpy.array(image)
a = pyplot.imshow(image, aspect = 10)
Here the output plot
I tried to use numpy.apply_along_axis, but this function allows me to iterate only along the rows of image, while i need to iterate along the ithStep index too. E.g.:
#now i build an image using x-y-times coordinates and x-times weights
nrows = nStepsY
ncolumns = 80
matrix = numpy.zeros(nrows*ncolumns).reshape(nrows,ncolumns)
def applyIt(image):
image = numpy.zeros(ncolumns)
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[positions] = values
return image
imageApplied = numpy.apply_along_axis(applyIt,1,matrix)
a = pyplot.imshow(imageApplied, aspect = 10)
It obviously return only the firs row nrows times, since nothing iterates ithStep:
And here the wrong plot
There is a way to iterate an index, or to use an index while numpy.apply_along_axis iterates?
Here the code with only matricial operations: it's quite faster than map or apply_along_axis but uses so much memory.
(in this function i use a trick with scipy.sparse, which works more intuitively than numpy arrays when you try to sum numbers on a same element)
def fullmatrix(nRows, nColumns):
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
yTimed = numpy.outer(y,times)
x3d = numpy.outer(numpy.ones(nStepsY),x)
weights3d = numpy.outer(numpy.ones(nStepsY),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions)))).todense()
return matrix
image = fullmatrix(nStepsY, 80)
a = pyplot.imshow(image, aspect = 10)
This way is simplier and very fast! Thank you so much.
nStepsY = 5
nRows = nStepsY
nColumns = 80
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nRows, nColumns))
fakeRow = numpy.zeros(positions.size)
def itermatrix(ithStep):
yTimed = y[ithStep]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
matrix = sparse.coo_matrix((weights, (fakeRow, positions))).todense()
matrix = numpy.ravel(matrix)
missColumns = (nColumns-matrix.size)
zeros = numpy.zeros(missColumns)
matrix = numpy.concatenate((matrix, zeros))
return matrix
for i in numpy.arange(nStepsY):
image[i] = itermatrix(i)
#or, without initialization of image:
imageMapped = list(map(itermatrix, range(nStepsY)))
imageMapped = numpy.array(imageMapped)
It feels like attempting to use map or apply_along_axis is obscuring the essentially iteration of the problem.
I rewrote your code as an explicit loop on y:
nStepsY = 5
y = numpy.arange(1,nStepsY+1)
image = numpy.zeros((nStepsY, 80))
for i, yi in enumerate(y):
yTimed = yi*times
positions = (numpy.round(x-yTimed)+50).astype(int)
values = numpy.bincount(positions,weights)
values = values[numpy.nonzero(values)]
positions = numpy.unique(positions)
image[i, positions] = values
a = pyplot.imshow(image, aspect = 10)
pyplot.show()
Looking at the code, I think I could calculate positions for all y values making a (y.shape[0],times.shape[0]) array. But the rest, the bincount and unique still have to work row by row.
apply_along_axis when working with a 2d array, and axis=1 essentially does:
res = np.zeros_like(arr)
for i in range....:
res[i,:] = func1d(arr[i,:])
If the input array has more dimensions it constructs a more elaborate indexing object [i,j,k,:]. And it can handle cases where func1d returns a different size array than the input. But in any case it is just a generalized iteration tool.
Moving the initial positions creation outside the loop:
yTimed = y[:,None]*times
positions = (numpy.round(x-yTimed)+50).astype(int)
image = numpy.zeros((positions.shape[0], 80))
for i, pos in enumerate(positions):
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
image[i, pos] = values
Now I can cast this as an apply_along_axis problem, with an applyIt that takes a positions vector (with all the yTimed information) rather than blank image vector.
def applyIt(pos, size, weights):
acolumn = numpy.zeros(size)
values = numpy.bincount(pos,weights)
values = values[numpy.nonzero(values)]
pos = numpy.unique(pos)
acolumn[pos] = values
return acolumn
image = numpy.apply_along_axis(applyIt, 1, positions, 80, weights)
Timing wise I expect it's a bit slower than my explicit iteration. It has to do more setup work, including a test call applyIt(positions[0,:],...) to determine the size of its return array (i.e image has different shape than positions.)
def csrmatrix(y, times, x, weights):
yTimed = numpy.outer(y,times)
n=y.shape[0]
x3d = numpy.outer(numpy.ones(n),x)
weights3d = numpy.outer(numpy.ones(n),weights)
y3d = numpy.outer(y,numpy.ones(x.size))
positions = (numpy.round(x3d-yTimed)+50).astype(int)
#print(y.shape, weights3d.shape, y3d.shape, positions.shape)
matrix = sparse.csr_matrix((numpy.ravel(weights3d), (numpy.ravel(y3d), numpy.ravel(positions))))
#print(repr(matrix))
return matrix
# one call
image = csrmatrix(y, times, x, weights)
# iterative call
alist = []
for yi in numpy.arange(1,nStepsY+1):
alist.append(csrmatrix(numpy.array([yi]), times, x, weights))
def mystack(alist):
# concatenate without offset
row, col, data = [],[],[]
for A in alist:
A = A.tocoo()
row.extend(A.row)
col.extend(A.col)
data.extend(A.data)
print(len(row),len(col),len(data))
return sparse.csr_matrix((data, (row, col)))
vimage = mystack(alist)

getting elements in an array1 that are not in array2

Main Problem
What is the better/pythonic way of retrieving elements in a particular array that are not found in a different array. This is what I have;
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
My interest is in performance. My data is an (X,Y,Z) array of size (7000 x 3) and my gdata is an (X,Y) array of (11000 x 2)
Preamble
I am working on an octant search to find the n-number(e.g. 8) of points (+) closest to my circular point (o) in each octant. This would mean that my points (+) are reduced to only 64 (8 per octant). Then for each gdata I would save the elements that are not found in data.
import tkinter as tk
from tkinter import filedialog
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from collections import defaultdict
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
data = pd.read_excel(file_path)
data = np.array(data, dtype=np.float)
nrow, cols = data.shape
file_path1 = filedialog.askopenfilename()
gdata = pd.read_excel(file_path1)
gdata = np.array(gdata, dtype=np.float)
gnrow, gcols = gdata.shape
N=8
delta = gdata - data[:,:2]
angles = np.arctan2(delta[:,1], delta[:,0])
bins = np.linspace(-np.pi, np.pi, 9)
bins[-1] = np.inf # handle edge case
octantsort = []
for j in range(gnrow):
delta = gdata[j, ::] - data[:, :2]
angles = np.arctan2(delta[:, 1], delta[:, 0])
octantsort = []
for i in range(8):
data_i = data[(bins[i] <= angles) & (angles < bins[i+1])]
if data_i.size > 0:
dist_order = np.argsort(cdist(data_i[:, :2], gdata[j, ::][np.newaxis]), axis=0)
if dist_order.size < npoint_per_octant+1:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(dist_order.size)]
else:
[octantsort.append(data_i[dist_order[:npoint_per_octant][j]]) for j in range(npoint_per_octant)]
final = np.vstack(octantsort)
idata = [np.column_stack(data[k]) for k in range(len(data)) if data[k] not in final]
idata = np.vstack(idata)
Is there an efficient and pythonic way of doing this do increase performance in the last two lines of the code?
If I understand your code correctly, then I see the following potential savings:
dedent the final = ... line
don't use arctan it's expensive; since you only want octants compare the coordinates to zero and to each other
don't do a full argsort, use argpartition instead
make your octantsort an "octantargsort", i.e. store the indices into data, not the data points themselves; this would save you the search in the last but one line and allow you to use np.delete for removing
don't use append inside a list comprehension. This will produce a list of Nones that is immediately discarded. You can use list.extend outside the comprehension instead
besides, these list comprehensions look like a convoluted way of converting data_i[dist_order[:npoint_per_octant]] into a list, why not simply cast, or even keep as an array, since you want to vstack in the end?
Here is some sample code illustrating these ideas:
import numpy as np
def discard_nearest_in_each_octant(eater, eaten, n_eaten_p_eater):
# build octants
# start with quadrants ...
top, left = (eaten < eater).T
quadrants = [np.where(v&h)[0] for v in (top, ~top) for h in (left, ~left)]
dcoord2 = (eaten - eater)**2
dc2quadrant = [dcoord2[q] for q in quadrants]
# ... and split them
oct4158 = [q[:, 0] < q [:, 1] for q in dc2quadrant]
# main loop
dc2octants = [[q[o], q[~o]] for q, o in zip (dc2quadrant, oct4158)]
reloap = [[
np.argpartition(o.sum(-1), n_eaten_p_eater)[:n_eaten_p_eater]
if o.shape[0] > n_eaten_p_eater else None
for o in opair] for opair in dc2octants]
# translate indices
octantargpartition = [q[so] if oap is None else q[np.where(so)[0][oap]]
for q, o, oaps in zip(quadrants, oct4158, reloap)
for so, oap in zip([o, ~o], oaps)]
octantargpartition = np.concatenate(octantargpartition)
return np.delete(eaten, octantargpartition, axis=0)

Full Frequency Array Reconstruction after numpy.fft.rfftn

I have a real valued grayscale 3D image with resolution rows x cols x deps. I take the dft of the image using freq = numpy.fft.rfftn(myImage)
The returned array, freq, is resolution: rows x cols x deps/2 + 1. I want to reconstruct freq as if it were the output of numpy.fft.fftn(myImage), that is, I want the dimensions of freq to be rows x cols x deps.
I know that the correspondence for real-valued dft is X_(k1,k2,k3) = X*_(N1-k1,N2-k2,N3-k3) where * is the conjugate transpose.
I could reconstruct the full freq array using a loop, but that'll be too slow, but I'm having trouble figuring out the correct way of doing it with array slicing.
Thanks!
FYI, I need the full array because I'll be element wise multiplying it with another array of full size rows x cols x deps, I cannot assume that array has any structure (like symmetry) that would make it unnecessary for me to reconstruct the full freq array.
I got it!
import numpy as np
import time
rows = 181
cols = 217
deps = 181
jac_k = np.random.rand(rows, cols, deps)*5
prev = time.time()
fft1 = np.fft.fftn(jac_k)
print time.time() - prev
prev = time.time()
fft2 = np.fft.rfftn(jac_k)
if deps%2 == 0:
fft2Star = np.conj(fft2[:, :, -2:0:-1])
else:
fft2Star = np.conj(fft2[:, :, -1:0:-1])
fft2Star[1::, :, :] = fft2Star[:0:-1, :, :]
fft2Star[:, 1::, :] = fft2Star[:, :0:-1, :]
fft2 = np.concatenate( (fft2, fft2Star), axis=2)
print time.time() - prev
print np.linalg.norm(fft1 - fft2)

Categories