First of all, I work with byte array (>= 400x400x1000) bytes.
I wrote a small function which can insert a multidimensional array (or a fraction of) into another one by indicating an offset. This works if the embedded array is smaller than the embedding array (case A). Otherwise the embedded array is truncated (case B).
case A) Inserting a 3x3 into a 5x5 matrix with offset 1,1 would look like this.
[[ 0. 0. 0. 0. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 0. 0. 0. 0.]]
case B) If the offsets are exceeding the dimensions of the embedding matrix, the smaller array is truncated. E.g. a (-1,-1) offset would result in this.
[[ 1. 1. 0. 0. 0.]
[ 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
case C) Now, instead of truncating the embedded array, I want to extend the embedding array (by zeroes) if the embedded array is either bigger than the embedding array or the offsets enforce it (e.g. case B). Is there a smart way with numpy or scipy to solve this?
[[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]]
Actually I work with 3D array, but for simplicity I wrote an example for 2D arrays. Current source:
import numpy as np
import nibabel as nib
def addAtPos(mat_bigger, mat_smaller, xyz_coor):
size_sm_x, size_sm_y = np.shape(mat_smaller)
size_gr_x, size_gr_y = np.shape(mat_bigger)
start_gr_x, start_gr_y = xyz_coor
start_sm_x, start_sm_y = 0,0
end_x, end_y = (start_gr_x + size_sm_x), (start_gr_y + size_sm_y)
print(size_sm_x, size_sm_y)
print(size_gr_x, size_gr_y)
print(end_x, end_y)
if start_gr_x < 0:
start_sm_x = -start_gr_x
start_gr_x = 0
if start_gr_y < 0:
start_sm_y = -start_gr_y
start_gr_y = 0
if end_x > size_gr_x:
size_sm_x = size_sm_x - (end_x - size_gr_x)
end_x = size_gr_x
if end_y > size_gr_y:
size_sm_y = size_sm_y - (end_y - size_gr_y)
end_y = size_gr_y
# copy all or a chunk (if offset is small/big enough) of the smaller matrix into the bigger matrix
mat_bigger[start_gr_x:end_x, start_gr_y:end_y] = mat_smaller[start_sm_x:size_sm_x, start_sm_y:size_sm_y]
return mat_bigger
a_gr = np.zeros([5,5])
a_sm = np.ones([3,3])
a_res = addAtPos(a_gr, a_sm, [-2,1])
#print (a_gr)
print (a_res)
Actually there is an easier way to do it.
For your first example of a 3x3 array embedded to a 5x5 one you can do it with something like:
A = np.array([[1,1,1], [1,1,1], [1,1,1]])
(N, M) = A.shape
B = np.zeros(shape=(N + 2, M + 2))
B[1:-1:, 1:-1] = A
By playing with slicing you can select a subset of A and insert it anywhere within a continuous subset of B.
Hope it helps! ;-)
Related
I'm doing Identity Matrix, but it comes TypeError: only integer scalar arrays can be converted to a scalar index, and IDK how to fix it, plz help me!
Z = np.array([
[0,2,0,4,4],
[0,0,3,0,0],
[0,0,0,1,0],
[0,2,0,0,0],
[0,0,0,0,0]
])
I = np.eye(Z)
I = np.identity(Z)
Both np.eye and np.identify come to the same error.
The fucntion np.identity() takes an integer argument, not np.array() object as argument. So if you want to create an identity matrix of size nxn you need to calculate the length of Z:
import numpy as np
Z = np.array([
[0,2,0,4,4],
[0,0,3,0,0],
[0,0,0,1,0],
[0,2,0,0,0],
[0,0,0,0,0]
])
I = np.identity(len(Z))
print(I)
Output:
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
The objective is to concatenate a Numpy Array according to a set of position. However, I am curious whether the concatenate and step as shown in the code below can be optimized further without the need of for loop and if-else statement?
tot_length=0.2 implementation
steps=0.1
start_val=0
repeat_perm=3
list_no =np.arange(start_val, tot_length, steps)
x, y, z = np.meshgrid(*[list_no for _ in range(3)], sparse=True)
ix = np.array(((x>=y) & (y>=z)).nonzero()).T
final_opt=list_no[ix]
final_opt[:,[0, 1]] = final_opt[:,[1, 0]]
all_result=itertools.product(range(0,ix.shape[1]), repeat=repeat_perm)
for num, num_pair in enumerate(all_result, start=1):
for num_x, num_pair_x in enumerate ( num_pair, start=0 ):
if (num == 1) &(num_x==0) :
cont_arry = final_opt [num_pair_x, :]
else:
cont_arry= np.concatenate((cont_arry, final_opt [num_pair_x, :]), axis=0)
final_arr =np.reshape(cont_arry, (-1, 9))
print(final_arr)
Output of size (27, 9), but only partial are shown below
[[0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.1 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.1 0.1 0. ]
[0. 0. 0. 0.1 0. 0. 0. 0. 0. ]
[0. 0. 0. 0.1 0. 0. 0.1 0. 0. ]
[0. 0. 0. 0.1 0. 0. 0.1 0.1 0. ]
[0.1 0.1 0. 0.1 0.1 0. 0.1 0.1 0. ]]
Just some heads up,the cont_arry will be vectorised multiply with a 1D array of similar length with the cont_arry. Knowing this, is there a way to avoid from storing the result of concatenation on memory or what not to minimise potential memory issue since in actual application, the worst possible parameter setting is as below
tot_length=200
steps=0.1
start_val=0
repeat_perm=1200
I think your concatenate loop can be replaced with:
alist = []
for num, num_pair in enumerate(all_result, start=1):
for num_x, num_pair_x in enumerate ( num_pair, start=0 ):
alist.append( final_opt [num_pair_x, :]))
arr = np.array(alist)
# arr = np.concatenate(alist, axis=0)
# arr = np.vstack(alist)
There may be some details in this that I didn't catch. I haven't tried to test it. List append is much faster than concatenate, especially when done repeatedly.
concatenate is most efficient when give a whole list of arrays to join.
Better yet, don't iterate at all; instead make use of whole-array math and indexing. But I haven't tried to master your code, so won't suggest how to do that.
I have this multidimensional array of shape (500000,3,2,3),let's call it data. The data is basically 500000 sets of 3 points,each of the 3 points seperated into its x and y coordinates (hence the 2). The last 3 in the shape represents different rotations of the 3 points. Now, I've got this 1d array of 500000 numbers between 0 and 2 that tell me which of the rotations I want to keep, let's call it rot_index. I would like to construct a multidimensional array of shape (500000,3,2) that only keeps the correctly rotated data points. Any ideas on how to extract the data with the correct index from the original data array? I tried something like this, but it didn't work
data[:,:,:,rot_index]
Edit:
here is some example data (giving 10 sets of points instead of 500000)
data =
[[[[0.70846822 0.98552876 0.66736535]
[0. 0. 0. ]]
[[0.66736535 0.70846822 0.98552876]
[1.54545219 2.39798549 2.33974762]]
[[0.98552876 0.66736535 0.70846822]
[3.88519982 3.94343768 4.73773311]]]
[[[0.8132551 1.18845796 1.53004225]
[0. 0. 0. ]]
[[1.18845796 1.53004225 0.8132551 ]
[1.43211754 2.58720625 2.26386152]]
[[1.53004225 0.8132551 1.18845796]
[4.01932379 4.85106777 3.69597906]]]
[[[0.66123513 0.93651048 0.83170562]
[0. 0. 0. ]]
[[0.93651048 0.83170562 0.66123513]
[2.09747072 2.38383457 1.80188002]]
[[0.83170562 0.66123513 0.93651048]
[4.48130529 4.18571459 3.89935074]]]
[[[1.31047414 0.67740955 1.42020073]
[0. 0. 0. ]]
[[0.67740955 1.42020073 1.31047414]
[1.66061575 1.97600777 2.64656179]]
[[1.42020073 1.31047414 0.67740955]
[3.63662352 4.62256956 4.30717753]]]
[[[1.4085555 1.64177102 0.27708893]
[0. 0. 0. ]]
[[0.27708893 1.4085555 1.64177102]
[0.62154257 3.04315813 2.61848461]]
[[1.64177102 0.27708893 1.4085555 ]
[3.24002718 3.6647007 5.66164274]]]
[[[0.48080385 0.85910831 0.52342904]
[0. 0. 0. ]]
[[0.52342904 0.48080385 0.85910831]
[1.08970318 2.57102289 2.62245924]]
[[0.85910831 0.52342904 0.48080385]
[3.71216242 3.66072607 5.19348213]]]
[[[1.13610207 1.51237019 0.47256909]
[0. 0. 0. ]]
[[1.51237019 0.47256909 1.13610207]
[2.92304081 2.59328103 0.76686347]]
[[0.47256909 1.13610207 1.51237019]
[5.51632184 3.3601445 3.68990428]]]
[[[1.08397801 1.16506242 0.84703646]
[0. 0. 0. ]]
[[1.16506242 0.84703646 1.08397801]
[2.37250664 2.04419242 1.86648625]]
[[0.84703646 1.08397801 1.16506242]
[4.41669906 3.91067866 4.23899289]]]
[[[0.98734317 1.11177984 0.90283297]
[0. 0. 0. ]]
[[1.11177984 0.90283297 0.98734317]
[2.25981006 2.13666143 1.88671382]]
[[0.90283297 0.98734317 1.11177984]
[4.39647149 4.02337525 4.14652387]]]
[[[1.94118244 1.14738719 1.98251535]
[0. 0. 0. ]]
[[1.14738719 1.98251535 1.94118244]
[1.83291888 1.90183408 2.54843234]]
[[1.98251535 1.94118244 1.14738719]
[3.73475296 4.45026642 4.38135123]]]]
And here is a list of the indices I want to keep:
rot_index = np.array([1 2 1 1 1 1 1 2 1 1])
So just as an example, if you consider
data[0,:,:,0] = [[0.70846822 0.]
[0.66736535 1.54545219]
[0.98552876 3.88519982]]
data[0,:,:,1] = [[0.98552876 0.]
[0.70846822 2.39798549]
[0.66736535 3.94343768]]
data[0,:,:,2] = [[0.66736535 0.]
[0.98552876 2.33974762]
[0.70846822 4.73773311]]
These are 3 different "rotations" of the same sample, and if we look at the first element of rot_index, it is a 1. So I only want to keep
data[0,:,:,1] = [[0.98552876 0.]
[0.70846822 2.39798549]
[0.66736535 3.94343768]]
Using numpy advanced indexing, and under that, the specific subtopic of combining advanced and basic indexing this should work (where data_array is a numpy ndarray having your data):
result = data_array[range(500000),...,rot_index]
For your sample data, this produces:
[[[0.98552876 0. ]
[0.70846822 2.39798549]
[0.66736535 3.94343768]]
[[1.53004225 0. ]
[0.8132551 2.26386152]
[1.18845796 3.69597906]]
[[0.93651048 0. ]
[0.83170562 2.38383457]
[0.66123513 4.18571459]]
[[0.67740955 0. ]
[1.42020073 1.97600777]
[1.31047414 4.62256956]]
[[1.64177102 0. ]
[1.4085555 3.04315813]
[0.27708893 3.6647007 ]]
[[0.85910831 0. ]
[0.48080385 2.57102289]
[0.52342904 3.66072607]]
[[1.51237019 0. ]
[0.47256909 2.59328103]
[1.13610207 3.3601445 ]]
[[0.84703646 0. ]
[1.08397801 1.86648625]
[1.16506242 4.23899289]]
[[1.11177984 0. ]
[0.90283297 2.13666143]
[0.98734317 4.02337525]]
[[1.14738719 0. ]
[1.98251535 1.90183408]
[1.94118244 4.45026642]]]
I'm very new to GPU programming and pyCUDA and have a pretty fundamental gap in my knowledge. I have spent quite a bit of time searching SO, looking at example code and reading supporting documentation for CUDA/pyCUDA but haven't found much diversity in the explanations and can't get my head around a few things.
I am having trouble correctly defining block and grid dimensions. The code I am currently running is as follows, and aims to do element-wise multiplication of an array a by a float b:
from __future__ import division
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
rows = 256
cols = 10
a = np.ones((rows, cols), dtype=np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
b = np.float32(2)
mod = SourceModule("""
__global__ void MatMult(float *a, float b)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
const int j = threadIdx.y + blockDim.y * blockIdx.y;
int Idx = i + j*gridDim.x;
a[Idx] *= b;
}
""")
func = mod.get_function("MatMult")
xBlock = np.int32(np.floor(1024/rows))
yBlock = np.int32(cols)
bdim = (xBlock, yBlock, 1)
dx, mx = divmod(rows, bdim[0])
dy, my = divmod(cols, bdim[1])
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1])
print "bdim=",bdim, ", gdim=", gdim
func(a_gpu, b, block=bdim, grid=gdim)
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2*a
The code should print the block dimensions bdim and the grid dimensions gdim, as well as an array of zeroes.
This works for small array sizes, for example, if rows=256 and cols=10 (as in the example above) the output is as follows:
bdim= (4, 10, 1) , gdim= (256, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
However, if I increase rows=512, I get the following output:
bdim= (2, 10, 1) , gdim= (512, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 2. 2. 2. ..., 2. 2. 2.]
[ 2. 2. 2. ..., 2. 2. 2.]
[ 2. 2. 2. ..., 2. 2. 2.]]
Indicating that the multiplication is happening twice for some elements of the array.
However, if I force the block dimensions to bdim = (1,1,1), the problem no longer occurs and I get the following (correct) output for the larger array size:
bdim= (1, 1, 1) , gdim= (512, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
I don't understand this. What is happening here which means that this method of defining the block and grid dimensions is no longer appropriate as the array size is increased? Also, if block has dimensions (1,1,1) does this mean that the calculation is being performed serially?
Thanks in advance for any pointers and help!
You operate on 2D grid of 2D blocks. In your kernel you seem to assume that gridDim.x would return number of threads in x dimension of a grid.
__global__ void MatMult(float *a, float b)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
const int j = threadIdx.y + blockDim.y * blockIdx.y;
int Idx = i + j*gridDim.x;
a[Idx] *= b;
}
The gridDim.x returns number of blocks r x direction of grid, not number of threads. In order to obtain number of threads in given direction you should multiply number of threads in a block with number of blocks in a grid in the same direction:
int Idx = i + j * blockDim.x * gridDim.x
I found an answer here, but it is not clear if I should reshape the array. Do I need to reshape the 2d array into 1d before passing it to pycuda kernel?
There is no need to reshape a 2D gpuarray in order to pass it to a CUDA kernel.
As I said in the answer you linked to, a 2D numpy or PyCUDA array is just an allocation of pitched linear memory, stored in row major order by default. Both have two members which tell you everything that you need to access an array - shape and strides. For example:
In [8]: X=np.arange(0,15).reshape((5,3))
In [9]: print X.shape
(5, 3)
In [10]: print X.strides
(12, 4)
The shape is self explanatory, the stride is the pitch of the storage in bytes. The best practice for kernel code will be to treat the pointer supplied by PyCUDA as if it were allocated using cudaMallocPitch and treat the first element of stride as the byte pitch of the rows in memory. A trivial example might look like this:
import pycuda.driver as drv
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy as np
mod = SourceModule("""
__global__ void diag_kernel(float *dest, int stride, int N)
{
const int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < N) {
float* p = (float*)((char*)dest + tid*stride) + tid;
*p = 1.0f;
}
}
""")
diag_kernel = mod.get_function("diag_kernel")
a = np.zeros((10,10), dtype=np.float32)
a_N = np.int32(a.shape[0])
a_stride = np.int32(a.strides[0])
a_bytes = a.size * a.dtype.itemsize
a_gpu = drv.mem_alloc(a_bytes)
drv.memcpy_htod(a_gpu, a)
diag_kernel(a_gpu, a_stride, a_N, block=(32,1,1))
drv.memcpy_dtoh(a, a_gpu)
print a
Here some memory is allocated on the device, a zeroed 2D array is copied to that allocation directly, and the result of the kernel (filling the diagonals with 1) copied back to the host and printed. It isn't necessary to flatten or otherwise modify the shape or memory layout of the 2D numpy data at any point in the process. The result is:
$ cuda-memcheck python ./gpuarray.py
========= CUDA-MEMCHECK
[[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
========= ERROR SUMMARY: 0 errors