A simple question on CUDA threads in numba

A simple question on CUDA threads in numba - python

It's a very beginner oriented question. I have been working on regular python threads and C threads and learnt that I can create threads that run a specific function and they use semaphores and other sync primitives.
But, I am currently trying to learn Cuda using the numba's python based compiler. I have written the following code.
from numba import cuda
import numpy as np
#cuda.jit
def image_saturate(data):
pos_x, pos_y = cuda.grid(2)
if (pos_x, pos_y) <= data.shape:
data[pos_x, pos_y] = 1
if __name__ == "__main__":
image_quality = (128, 72)
image = np.zeros(image_quality)
thread_size = 32
block_size = image_quality
image_saturate[block_size, thread_size](image)
print(image)
But, the thing that I feel weird is, I can change thread_size as I want and the result is the same - meaning the output is all ones as expected. But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.
Form this I understand that the cuda.grid(2) returns the block coordinates. But, shouldn't I be able to get the actual thread coordinates and the block coordinates as well?
I am terribly new and I can't find any resources to learn online. It would be great if anyone can answer my question and also provide and resources for learning Cuda using Numba.

Form this I understand that the cuda.grid(2) returns the block coordinates.
That's not the case. That statement returns a fully-qualified 2D thread index. The range of the returned values will extend to the product of the block coordinates limit and the thread coordinates limit.
In CUDA, the grid dimension parameter for a kernel launch (the one you are calling block_size) specifies the grid dimension in terms of number of blocks (in each direction). The block dimension parameter for a kernel launch (the one you are calling thread_size) specifies the size of each block in the grid, in terms of the number of threads per block.
Therefore the total number of threads launched is equal to the product of the grid dimension(s) and the the block dimension(s). The total would be the product of all those things, in all dimensions. The total per dimension would be the product of the grid dimension in that direction and the block dimension in that direction.
So you have a questionable design choice, in that you have an image size and you are setting the grid dimension equal to the image size. This could only be sensible if you had only 1 thread per block. As you will discover by looking at any proper numba CUDA code (such as the one here) a typical approach is to divide the total desired dimension (in this case, the image size or dimension(s)), by the number of threads per block, to get the grid dimension.
When we do so, the cuda.grid() statement in your kernel code will return a tuple that has sensible ranges. In your case, it would return tuples to threads that correctly go from 0..127 in x, and 0..71 in y. The problem you have at the moment is that the cuda.grid() statement can return tuples that range from 0..((128*32)-1) in x, and that is unnecessary.
Of course, the goal of your if statement is to prevent out-of-bounds indexing, however the test of <= does not look right to me. This is the classical computer science off-by-1 error. Threads whose indices happen to match the limit returned by shape should be excluded.
But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.
It's really not clear what your expectations are here. Your kernel design is such that each thread populates (at most) one output point. Therefore sensible grid sizing is to match the grid (the total threads in x, and the total threads in y) to the image dimensions. If you follow the above recommendations for grid sizing calculations, and then set your grid size to something less than your image size, I would expect that portions of your output image would not be populated. Don't do that. Or if you must do that, employ a grid-stride loop kernel design.
Having said all that, the following is how I would rewrite your code:
from numba import cuda
import numpy as np
#cuda.jit
def image_saturate(data):
pos_x, pos_y = cuda.grid(2)
if pos_x < data.shape[0] and pos_y < data.shape[1]:
data[pos_x, pos_y] = 1
if __name__ == "__main__":
image_x = 128
image_y = 72
image_quality = (image_x, image_y)
image = np.zeros(image_quality)
thread_x = 32
thread_y = 1
thread_size = (thread_x, thread_y)
block_size = ((image_x//thread_x) + 1, (image_y//thread_y) + 1) # "lazy" round-up
image_saturate[block_size, thread_size](image)
print(image)
it appears to run correctly for me. If you now suggest that what you want to do is to arbitrary modify the block_size variable e.g.:
block_size = (5,5)
and make no other changes, and expect the output image to be fully populated, I would say that is not a sensible expectation. I have no idea how that could be sensible, so I will just say that CUDA doesn't work that way. If you wish to "decouple" the data size from the grid size, the canonical way to do it is the grid stride loop as already discussed.
I've also removed the tuple comparison. I don't think it is really germane here. If you still want to use the tuple comparison, that should work exactly as you would expect based on python. There isn't anything CUDA specific about it.

I'm late to the game, but I thought an explanation with more visual components might help in lending some clarity to the existing answer. It is surprisingly hard to find illustrative answers for how thread indexing works in cuda. Though the concepts are very easy to hold in your head once you come to understand them, there can be many points for confusion along the way to getting that understanding, hopefully this helps.
Pardon the lack of discussion outside of the comments of the script, but this seems like a case where the context of the code will help avoid miscommunication and help to demonstrate the indexing concepts discussed by others. So I'll leave you with the script, the comments therein, and the compressed output.
To produce uncompressed output, see the first few comments in the thread-main block.
Example script:
from numba import cuda
import numpy as np
#cuda.jit
def image_saturate(data):
grid_pos_x, grid_pos_y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
if (grid_pos_x, grid_pos_y) < data.shape:
# note that the cuda device array retains the stride order of the original numpy array,
# so the data is in row-major order (numpy default) and the first index (where we are
# using grid_pos_x) doesn't actually map to the horizontal axis (aka the typical x-axis),
# but rather maps to vertical axis (the typical y-axis).
#
# And, as you would then expecet, the second axis (where we are using grid_pos_y) maps
# to the horizontal axis of the array.
#
# What you should take away from this observation is that the x,y labels of cuda elements
# have no explicit connection to the array's memory layout.
#
# Therefore, it is up to you, the programmer to understand the memory layout for your
# array (whether it's the C-like row-major, or the Fortran-like column-major), and how
# you should map the (x,y) thread IDs onto the coordinates of your array.
data[grid_pos_x, grid_pos_y,0,0] = tx
data[grid_pos_x, grid_pos_y,0,1] = ty
data[grid_pos_x, grid_pos_y,0,2] = tx + ty
data[grid_pos_x, grid_pos_y,1,0] = bx
data[grid_pos_x, grid_pos_y,1,1] = by
data[grid_pos_x, grid_pos_y,1,2] = bx + by
data[grid_pos_x, grid_pos_y,2,0] = grid_pos_x
data[grid_pos_x, grid_pos_y,2,1] = grid_pos_y
data[grid_pos_x, grid_pos_y,2,2] = grid_pos_x + grid_pos_y
if __name__ == "__main__":
# uncomment the following line and remove the line after it
# if you run this code to get more readable results
# np.set_printoptions(linewidth=500)
np.set_printoptions(linewidth=500,threshold=3) # compressed output for use on stack overflow
# image_quality = (128, 72)
# we are shrinking image_quality to be 23x21 to make the printout easier to read,
# and intentionally step away from the alignment of threads per block being a
# multiplicative factor to the image shape. THIS IS BAD for practical applications,
# it's just helpful for this illustration.
image_quality = (23,21)
image = np.zeros(image_quality+(3,3),int)-1
## thread_size = 32 # commented to show where the original variable was used
# below, we rename the variable to be more semantically clear
# Note: to define the desired thread-count in multiple axis, threads_per_block
# would have to become a tuple, E.G.:
# # defines 32 threads in thread-block's x axis, and 16 in the y
# threads_per_block = 32,16
threads_per_block = 32
#threads_per_block = 32 results in an implicit 1 for the y-axis, and implicit 1 in the z.
### Thread blocks are always 3d
# this is also true for the thread grid the device will create for your kernel
## block_size = image_quality
# renaming block_size to semantically more accurate variable name
# Note: As with the threads_per_block, any axis we don't explicitly specify a size
# for will be given the default value of 1. So, because image_quality gives 2 values,
# for x/y respectively, the z axis will implicitly be given size of 1.
block_count = image_quality
# REMEMBER: The thread/block/grid dimensions we are passing to the function compiler
# are NOT used to infer details about the arguments being passed to the
# compiled function (our image array)
# It is up to us to write code that appropriately utilizes the arrangement
# of the thread blocks the device will build for us once inside the kernel.
# SEE THE COMMENT INSIDE THE image_saturate function above.
image_saturate[block_count, threads_per_block](image)
print(f"{block_count=}; {threads_per_block=}; {image.shape=}")
print("thread id within block; x")
print(image[:,:,0,0])
print("\nthread id within block; y"
"\n-- NOTE 1 regarding all zeros: see comment at the end of printout")
print(image[:,:,0,1])
print("\nsum of x,y thread id within block")
print(image[:,:,0,2])
print("\nblock id within grid; x"
"\n-- NOTE 2 also regarding all zeros: see second comment at the eod of printout")
print(image[:,:,1,0])
print("\nblock id within grid; y")
print(image[:,:,1,1])
print("\nsum of x,y block id within grid")
print(image[:,:,1,2])
print("\nthread unique global x id within full grid; x")
print(image[:,:,2,0])
print("\nthread unique global y id within full grid; y")
print(image[:,:,2,1])
print("\nsum of thread's unique global x,y ids")
print(image[:,:,2,2])
print(f"{'End of 32 threads_per_block output':-<70}")
threads_per_block = 16
# reset the values of image so we can be sure to see if any elements
# of the image go unassigned
image *= 0
image -= 1
# block_count = image_quality # if you wanted to try
print(f"\n\n{block_count=}; {threads_per_block=}; {image.shape=}")
image_saturate[block_count, threads_per_block](image)
print("thread id within block; x")
print(image[:,:,0,0])
print("\nthread id within block; y "
"\n-- again, see NOTE 1")
print(image[:,:,0,1])
print("\nsum of x,y thread id within block")
print(image[:,:,0,2])
print("\nblock id within grid; x "
"\n-- notice that unlike when we had 32 thread_per_block, not all 0")
print(image[:,:,1,0])
print("\nblock id within grid; y")
print(image[:,:,1,1])
print("\nsum of x,y block id within grid")
print(image[:,:,1,2])
print("\nthread unique global x id within full grid; x")
print(image[:,:,2,0])
print("\nthread unique global y id within full grid; y")
print(image[:,:,2,1])
print("\nsum of thread's unique global x,y ids")
print(image[:,:,2,2])
from textwrap import dedent
print(dedent("""
NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.
So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.
NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32**
The block IDs recorded for 'block id within grid; x' are all zero for
both versions of `threads_per_block` results from similar reasons
mentioned in NOTE 1.
The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
\t* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
\t\t* (x:y:z -> 23:21:1)
\t* Where each block has 32 threads
\t\t* (x:y:z -> 32:1:1)
\t* And our image has height:width of 23:21 for 483 'pixels' in each
\t contrived layer of the image.
As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used."""))
The output:
block_count=(23, 21); threads_per_block=32; image.shape=(23, 21, 3, 3)
thread id within block; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread id within block; y
-- NOTE 1 regarding all zeros: see comment at the end of printout
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
sum of x,y thread id within block
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
block id within grid; x
-- NOTE 2 also regarding all zeros: see second comment at the eod of printout
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
block id within grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of x,y block id within grid
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
thread unique global x id within full grid; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread unique global y id within full grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of thread's unique global x,y ids
[[ 0 1 2 ... 18 19 20]
[ 1 2 3 ... 19 20 21]
[ 2 3 4 ... 20 21 22]
...
[20 21 22 ... 38 39 40]
[21 22 23 ... 39 40 41]
[22 23 24 ... 40 41 42]]
End of 32 threads_per_block output------------------------------------
block_count=(23, 21); threads_per_block=16; image.shape=(23, 21, 3, 3)
thread id within block; x
[[0 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]
[2 2 2 ... 2 2 2]
...
[4 4 4 ... 4 4 4]
[5 5 5 ... 5 5 5]
[6 6 6 ... 6 6 6]]
thread id within block; y
-- again, see NOTE 1
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
sum of x,y thread id within block
[[0 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]
[2 2 2 ... 2 2 2]
...
[4 4 4 ... 4 4 4]
[5 5 5 ... 5 5 5]
[6 6 6 ... 6 6 6]]
block id within grid; x
-- notice that unlike when we had 32 thread_per_block, not all 0
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]
[1 1 1 ... 1 1 1]]
block id within grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of x,y block id within grid
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 1 2 3 ... 19 20 21]
[ 1 2 3 ... 19 20 21]
[ 1 2 3 ... 19 20 21]]
thread unique global x id within full grid; x
[[ 0 0 0 ... 0 0 0]
[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
...
[20 20 20 ... 20 20 20]
[21 21 21 ... 21 21 21]
[22 22 22 ... 22 22 22]]
thread unique global y id within full grid; y
[[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
...
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]
[ 0 1 2 ... 18 19 20]]
sum of thread's unique global x,y ids
[[ 0 1 2 ... 18 19 20]
[ 1 2 3 ... 19 20 21]
[ 2 3 4 ... 20 21 22]
...
[20 21 22 ... 38 39 40]
[21 22 23 ... 39 40 41]
[22 23 24 ... 40 41 42]]
NOTE 1:
The thread IDs recorded for 'thread id within block; y'
are all zero for both versions of `threads_per_block` because we never
specify the number of threads per block that should be created for
the 'y' axis.
So, the compiler defaults to creating only a single thread along those
undefined axis of each block. For that reason, we see that the only
threadID.y value stored is 0 for all i,j elements of the array.
NOTE 2:
**Note 2 mostly pertains to the case where threads_per_block == 32
is greater than the number of elements in the corresponding axis of the image**
The block.x IDs recorded for 'block id within grid; x' are all zero for
the `32` version of `threads_per_block` the relative difference in size between
the specified number of threads per block and the number of elements in the
image along the corresponding axis.
The size of a block, in any axis, is determined by the specified number
of threads for that axis. In this example script, we define threads_per_block
to have an explicit 32 threads in the x axis, leaving the compiler to give an
implicit 1 for both the y and z axis. We then tell the compiler to create 23 blocks
in the x-axis, and 21 blocks in the y; resulting in:
* A kernel where the device creates a grid of blocks, 23:21:1 for 483 blocks
* (x:y:z -> 23:21:1)
* Where each block has 32 threads
* (x:y:z -> 32:1:1)
* And our image has height:width of 23:21 for 483 'pixels' in each
contrived layer of the image.
As it is hopefully being made clear now, you should see that because each
block has 32 threads on its x-axis, and we have only 23 elements on the corresponding
axis in the image, only 1 of the 23 blocks the device created along the grid's x-axis
will be used. Do note that the overhead of creating those unused blocks is a gross waste
of GPU processor time and could potentially reduce the available resources to the block
that does get used.

Related

Find the shortest path fast

I want to make the shortest path between many points.
I generate an 8x8 matrix, with random values like:
[[ 0 31 33 0 43 10 0 0]
[31 0 30 0 0 13 0 0]
[33 30 0 11 12 5 6 0]
[ 0 0 11 0 15 0 38 11]
[43 0 12 15 0 39 0 0]
[10 13 5 0 39 0 3 49]
[ 0 0 6 38 0 3 0 35]
[ 0 0 0 11 0 49 35 0]]
Now I want to take the first list and see which is the smaller number. The see where it is in the list and take its position. Next I clear the first list to forget the first point. And put the next position in a new list of path. Then it will do the same for the new point. And at the final when all points are in my list of path it shows me the shortest way.
indm=0
lenm=[]
prochain=max(matrixF[indm])
chemin=[]
long=len(chemin)
while long != n:
for i in range (n):
if matrixF[indm,i] <= prochain and matrixF[indm,i]!=0:
pluspetit=matrixF[indm,i]
prochainpoint=np.where(matrixF == pluspetit)
chemin.append(prochainpoint)
indm=prochainpoint
for i in range (n):
matrixF[indm,i]=0
long=len(chemin)
print(chemin)
print(matrixF)
But I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This line computes all the indices where matrixF == pluspetit:
prochainpoint=np.where(matrixF == pluspetit)
Problem is, there's more than one, so on your first pass through, prochainpoint ends up as (array([0, 5]), array([5, 0])). You then set indm to prochainpoint, so on your next pass, instead of getting a single value with matrixF[indm,i], you retrieve a 2x2 array of (repeated) values, as if you'd done:
np.array([[matrixF[0,1], matrixF[5,1]]
[matrixF[5,1], matrixF[0,1]])
Comparing this to prochain (still a scalar value) produces a 2x2 array of boolean results, which you then try to test for truthiness, but numpy doesn't want to guess at whether you mean "are they all true?" or "are any of them true?", so it dumps that decision back on you with the error message.
I'm assuming the problem is with prochainpoint=np.where(matrixF == pluspetit), where you get many results when you presumably only want one, but I'm not clear what the real intent of the line is, so you'll have to figure out what you really intended to do there and replace it with something that consistently computes a single value.

I'm trying to get all rows with an Index 0 from one array into another one with for loop and np.concatenate

I'm trying to get all rows with an Index 0 from one array into another one with for loop and np.concatenate
i=0
data0 = np.zeros((1,257))
data0.shape = (257,)
for j in range (0,7291):
if datatrain[j,i] == 0:
data0 = np.concatenate((data0, datatrain[j,:]))
my problem is that after every loop data0 is being renewed, are there better approaches for this?

You don't need a loop at all:
col = 0
indices = np.where(datatrain[:, col] == 0)[0]
zero_col = np.zeros_like(indices).reshape(-1, 1)
data_of_interest = np.concatenate((zero_col, datatrain[indices, :]), axis=1)
Since I don't have a sample of your dataset, I can't test it for your specific situation.

Do you want to just get all rows with 0 in them? You can do that like this:
import numpy as np
datatrain = np.arange(25).reshape(5, 5)
datatrain[0][1] # 1st row has two 0s (arange starts at 0)
datatrain[1][2] = 0 # 2nd row now has a 0
datatrain[-1][4] = 0 # last row now has a 0
print(datatrain)
# Outputs:
# [[ 0 0 2 3 4]
# [ 5 6 0 8 9]
# [10 11 12 13 14]
# [15 16 17 18 19]
# [20 21 22 23 0]]
rows_inds_with_zeros, cols_with_zeros = np.where(datatrain == 0)
print(rows_inds_with_zeros)
# Ouputs: [0 0 1 4] (as expected, note 0th row included twice)
# You probably don't want the row twice if it has two 0s,
# although that's what your code does, hence np.unique
rows_with_zeros = datatrain[np.unique(rows_inds_with_zeros)]
print(rows_with_zeros) # Or call it data0, whatever you like
# Outputs:
# [[ 0 0 2 3 4]
# [ 5 6 0 8 9]
# [20 21 22 23 0]]
HTH.

input a none-regular matrix in python

link: https://cw.felk.cvut.cz/courses/a4b33alg/task.php?task=pary_py&idu=2341
I want to input the matrix split by space by using:
def neighbour_pair(l):
matrix = [[int(row) for row in input().split()] for i in range(l)]
but the program told me
TypeError: 'str' object cannot be interpreted as an integer
It seems the .split() didn't work but I don't know why.
here is an example of the input matrix:
13 5
7 50 0 0 1
2 70 10 11 0
4 30 9 0 0
6 70 0 0 0
1 90 8 12 0
9 90 0 2 1
13 90 0 6 0
5 30 4 3 0
12 80 0 0 1
10 50 0 0 1
11 50 0 0 0
3 80 1 13 0
8 70 7 0 1
The input is a binary tree with N nodes, the nodes are labeled by numbers 1 to N in random order, each label is unique. Each node contains an integer key in the range from 0 to (2^31)−1.
The first line of input contains two integers N and R separated by space. N is the number of nodes in the tree, R is the label of the tree root.
Next, there are N lines. Each line describes one node and the order of the nodes is arbitrary. A node is specified by five integer values. The first value is the node label, the second value is the node key, the third and the fourth values represent the labels of the left and right child respectively, and the fifth value represents the node color, white is 0, black is 1. If any of the children does not exist there is value 0 instead of the child label at the corresponding place. The values on the line are separated by a space.

This is the range() complaining that your l variable is a string:
>>> range('1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
I suspect you are reading the l from the standard in as well, cast it to integer:
l = int(input())
matrix = [[int(row) for row in input().split()] for i in range(l)]

I agree with #alecxe. It seems that your error is in reference to the string being used as l in your range(l) function. If I put a static int in the range() function it seems to work. 3 followed by three rows of input, will give me the below output.
>>> l = input() # define the number of rows expected the input matrix
>>> [[int(row) for row in input().split()] for i in range(int(l))]
13 5
7 50 0 0 1
2 70 10 11 0
output
[[13, 5], [7, 50, 0, 0, 1], [2, 70, 10, 11, 0]]
Implemented as a method, per the OP request in the comments below:
def neighbour_pair():
l = input()
return [[int(row) for row in input().split()] for i in range(int(l))]
print( neighbour_pair() )
# input
# 3
# 13 5
# 7 50 0 0 1
# 2 70 10 11 0
# output
[[13, 5], [7, 50, 0, 0, 1], [2, 70, 10, 11, 0]]
Still nothing wrong with this implementation...

replace zeroes in numpy array with the median value

I have a numpy array like this:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
I want to replace all the zeros with the median value of the whole array (where the zero values are not to be included in the calculation of the median)
So far I have this going on:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
foo = np.sort(foo)
print "foo sorted:",foo
#foo sorted: [ 0 0 0 0 0 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
print "nonzero_values?:",nz_values
#nonzero_values?: [ 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
size = np.size(nz_values)
middle = size / 2
print "median is:",nz_values[middle]
#median is: 26
Is there a clever way to achieve this with numpy syntax?
Thank you

This solution takes advantage of numpy.median:
import numpy as np
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
# Compute the median of the non-zero elements
m = np.median(foo[foo > 0])
# Assign the median to the zero elements
foo[foo == 0] = m
Just a note of caution, the median for your array (with no zeroes) is 23.5 but as written this sticks in 23.

foo2 = foo[:]
foo2[foo2 == 0] = nz_values[middle]
Instead of foo2, you could just update foo if you want. Numpy's smart array syntax can combine a few lines of the code you made. For example, instead of,
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
You can just do
nz_values = foo[foo > 0]
You can find out more about "fancy indexing" in the documentation.

Numpy, problem with long arrays

I have two arrays (a and b) with n integer elements in the range (0,N).
typo: arrays with 2^n integers where the largest integer takes the value N = 3^n
I want to calculate the sum of every combination of elements in a and b (sum_ij_ = a_i_ + b_j_ for all i,j). Then take modulus N (sum_ij_ = sum_ij_ % N), and finally calculate the frequency of the different sums.
In order to do this fast with numpy, without any loops, I tried to use the meshgrid and the bincount function.
A,B = numpy.meshgrid(a,b)
A = A + B
A = A % N
A = numpy.reshape(A,A.size)
result = numpy.bincount(A)
Now, the problem is that my input arrays are long. And meshgrid gives me MemoryError when I use inputs with 2^13 elements. I would like to calculate this for arrays with 2^15-2^20 elements.
that is n in the range 15 to 20
Is there any clever tricks to do this with numpy?
Any help will be highly appreciated.
--
jon

try chunking it. your meshgrid is an NxN matrix, block that up to 10x10 N/10xN/10 and just compute 100 bins, add them up at the end. this only uses ~1% as much memory as doing the whole thing.

Edit in response to jonalm's comment:
jonalm: N~3^n not n~3^N. N is max element in a and n is number of
elements in a.
n is ~ 2^20. If N is ~ 3^n then N is ~ 3^(2^20) > 10^(500207).
Scientists estimate (http://www.stormloader.com/ajy/reallife.html) that there are only around 10^87 particles in the universe. So there is no (naive) way a computer can handle an int of size 10^(500207).
jonalm: I am however a bit curios about the pv() function you define. (I
do not manage to run it as text.find() is not defined (guess its in another
module)). How does this function work and what is its advantage?
pv is a little helper function I wrote to debug the value of variables. It works like
print() except when you say pv(x) it prints both the literal variable name (or expression string), a colon, and then the variable's value.
If you put
#!/usr/bin/env python
import traceback
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
x=1
pv(x)
in a script you should get
x: 1
The modest advantage of using pv over print is that it saves you typing. Instead of having to
write
print('x: %s'%x)
you can just slap down
pv(x)
When there are multiple variables to track, it's helpful to label the variables.
I just got tired of writing it all out.
The pv function works by using the traceback module to peek at the line of code
used to call the pv function itself. (See http://docs.python.org/library/traceback.html#module-traceback) That line of code is stored as a string in the variable text.
text.find() is a call to the usual string method find(). For instance, if
text='pv(x)'
then
text.find('(') == 2 # The index of the '(' in string text
text[text.find('(')+1:-1] == 'x' # Everything in between the parentheses
I'm assuming n ~ 3^N, and n~2**20
The idea is to work module N. This cuts down on the size of the arrays.
The second idea (important when n is huge) is to use numpy ndarrays of 'object' type because if you use an integer dtype you run the risk of overflowing the size of the maximum integer allowed.
#!/usr/bin/env python
import traceback
import numpy as np
def pv(var):
(filename,line_number,function_name,text)=traceback.extract_stack()[-2]
print('%s: %s'%(text[text.find('(')+1:-1],var))
You can change n to be 2**20, but below I show what happens with small n
so the output is easier to read.
n=100
N=int(np.exp(1./3*np.log(n)))
pv(N)
# N: 4
a=np.random.randint(N,size=n)
b=np.random.randint(N,size=n)
pv(a)
pv(b)
# a: [1 0 3 0 1 0 1 2 0 2 1 3 1 0 1 2 2 0 2 3 3 3 1 0 1 1 2 0 1 2 3 1 2 1 0 0 3
# 1 3 2 3 2 1 1 2 2 0 3 0 2 0 0 2 2 1 3 0 2 1 0 2 3 1 0 1 1 0 1 3 0 2 2 0 2
# 0 2 3 0 2 0 1 1 3 2 2 3 2 0 3 1 1 1 1 2 3 3 2 2 3 1]
# b: [1 3 2 1 1 2 1 1 1 3 0 3 0 2 2 3 2 0 1 3 1 0 0 3 3 2 1 1 2 0 1 2 0 3 3 1 0
# 3 3 3 1 1 3 3 3 1 1 0 2 1 0 0 3 0 2 1 0 2 2 0 0 0 1 1 3 1 1 1 2 1 1 3 2 3
# 3 1 2 1 0 0 2 3 1 0 2 1 1 1 1 3 3 0 2 2 3 2 0 1 3 1]
wa holds the number of 0s, 1s, 2s, 3s in a
wb holds the number of 0s, 1s, 2s, 3s in b
wa=np.bincount(a)
wb=np.bincount(b)
pv(wa)
pv(wb)
# wa: [24 28 28 20]
# wb: [21 34 20 25]
result=np.zeros(N,dtype='object')
Think of a 0 as a token or chip. Similarly for 1,2,3.
Think of wa=[24 28 28 20] as meaning there is a bag with 24 0-chips, 28 1-chips, 28 2-chips, 20 3-chips.
You have a wa-bag and a wb-bag. When you draw a chip from each bag, you "add" them together and form a new chip. You "mod" the answer (modulo N).
Imagine taking a 1-chip from the wb-bag and adding it with each chip in the wa-bag.
1-chip + 0-chip = 1-chip
1-chip + 1-chip = 2-chip
1-chip + 2-chip = 3-chip
1-chip + 3-chip = 4-chip = 0-chip (we are mod'ing by N=4)
Since there are 34 1-chips in the wb bag, when you add them against all the chips in the wa=[24 28 28 20] bag, you get
34*24 1-chips
34*28 2-chips
34*28 3-chips
34*20 0-chips
This is just the partial count due to the 34 1-chips. You also have to handle the other
types of chips in the wb-bag, but this shows you the method used below:
for i,count in enumerate(wb):
partial_count=count*wa
pv(partial_count)
shifted_partial_count=np.roll(partial_count,i)
pv(shifted_partial_count)
result+=shifted_partial_count
# partial_count: [504 588 588 420]
# shifted_partial_count: [504 588 588 420]
# partial_count: [816 952 952 680]
# shifted_partial_count: [680 816 952 952]
# partial_count: [480 560 560 400]
# shifted_partial_count: [560 400 480 560]
# partial_count: [600 700 700 500]
# shifted_partial_count: [700 700 500 600]
pv(result)
# result: [2444 2504 2520 2532]
This is the final result: 2444 0s, 2504 1s, 2520 2s, 2532 3s.
# This is a test to make sure the result is correct.
# This uses a very memory intensive method.
# c is too huge when n is large.
if n>1000:
print('n is too large to run the check')
else:
c=(a[:]+b[:,np.newaxis])
c=c.ravel()
c=c%N
result2=np.bincount(c)
pv(result2)
assert(all(r1==r2 for r1,r2 in zip(result,result2)))
# result2: [2444 2504 2520 2532]

Check your math, that's a lot of space you're asking for:
2^20*2^20 = 2^40 = 1 099 511 627 776
If each of your elements was just one byte, that's already one terabyte of memory.
Add a loop or two. This problem is not suited to maxing out your memory and minimizing your computation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

A simple question on CUDA threads in numba - python

Related

Find the shortest path fast

I'm trying to get all rows with an Index 0 from one array into another one with for loop and np.concatenate

input a none-regular matrix in python

replace zeroes in numpy array with the median value

Numpy, problem with long arrays

Categories

Resources