I am given a pytorch 2-D tensor with integers, and 2 integers that always appear in each row of the tensor.
I want to create a binary mask that will contain 1 between the two appearances of these 2 integers, otherwise 0. For example, if the integers are 4 and 2 and the 1-D array is [1,1,9,4,6,5,1,2,9,9,11,4,3,6,5,2,3,4], the returned mask will be: [0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0].
Is there any efficient and quick way to compute this mask without iterations?
Perhaps a bit messy, but it works without iterations. In the following I assume an example tensor m to which I apply the solution, it's easier to explain with that instead of using general notations.
import torch
vals=[2,8]#let's assume those are the constant values that appear in each row
#target tensor
m=torch.tensor([[1., 2., 7., 8., 5.],
[4., 7., 2., 1., 8.]])
#let's find the indexes of those values
k=m==vals[0]
p=m==vals[1]
v=(k.int()+p.int()).bool()
nz_indexes=v.nonzero()[:,1].reshape(m.shape[0],2)
#let's create a tiling of the indexes
q=torch.arange(m.shape[1])
q=q.repeat(m.shape[0],1)
#you only need two masks, no matter the size of m. see explanation below
msk_0=(nz_indexes[:,0].repeat(m.shape[1],1).transpose(0,1))<=q
msk_1=(nz_indexes[:,1].repeat(m.shape[1],1).transpose(0,1))>=q
final_mask=msk_0.int() * msk_1.int()
print(final_mask)
and we get
tensor([[0, 1, 1, 1, 0],
[0, 0, 1, 1, 1]], dtype=torch.int32)
Regarding the two masks mask_0 and mask_1 in case it's not clear what they are, note nz_indexes[:,0] containts, for each row of m, the column index at which vals[0] is found, and nz_indexes[:,1] similarly contains, for each row of m, the column index at which vals[1] is found.
Based completely on the previous solution, here is the revised one:
import torch
vals=[2,8]#let's assume those are the constant values that appear in each row
#target tensor
m=torch.tensor([[1., 2., 7., 8., 5., 2., 6., 5., 8., 4.],
[4., 7., 2., 1., 8., 2., 6., 5., 6., 8.]])
#let's find the indexes of those values
k=m==vals[0]
p=m==vals[1]
v=(k.int()+p.int()).bool()
nz_indexes=v.nonzero()[:,1].reshape(m.shape[0],4)
#let's create a tiling of the indexes
q=torch.arange(m.shape[1])
q=q.repeat(m.shape[0],1)
#you only need two masks, no matter the size of m. see explanation below
msk_0=(nz_indexes[:,0].repeat(m.shape[1],1).transpose(0,1))<=q
msk_1=(nz_indexes[:,1].repeat(m.shape[1],1).transpose(0,1))>=q
msk_2=(nz_indexes[:,2].repeat(m.shape[1],1).transpose(0,1))<=q
msk_3=(nz_indexes[:,3].repeat(m.shape[1],1).transpose(0,1))>=q
final_mask=msk_0.int() * msk_1.int() + msk_2.int() * msk_3.int()
print(final_mask)
and we finally get
tensor([[0, 1, 1, 1, 0, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=torch.int32)
Related
The title may be a little obscure, sorry. I'm not quite sure how to phrase it otherwise.
The problem is as such: I have a 2D kernel (as is used for convolution) and a vector of weights (e.g., floats or ints).
I would like to produce a 2D array of the kernel multiplied by the weights, but stacked by the position of the weights. For example:
kernel = np.arange(6).reshape((2, 3))
nrows, kernel_length = kernel.shape
kernel_midpt = kernel_length // 2
weights = np.array([0, 1, 2, 3])
expected_output = np.zeros((nrows, len(weights) + 2*kernel_midpt))
for ii in range(len(weights)):
expected_output[:, ii:ii+kernel_length] += weights[ii] * kernel
expected_output
> array([[ 0., 0., 1., 4., 7., 6.],
[ 0., 3., 10., 22., 22., 15.]])
So basically, what this code is doing is taking the kernel
array([[0, 1, 2],
[3, 4, 5]])
and multiplying by 0, and adding it to the first three columns (0, 1, 2) of the output. Next, it multiplies the kernel by 1 and adds the results to columns (1, 2, 3), etc.
I obviously the code to do it with a for loop, but for computational purposes, it is very expensive. Is there a simpler way to do this? I have considered einsum, but haven't figure out how to use it.
I am working with a thematic raster of land use classes. The goal is to split the raster into smaller tiles of a given size. For example, I have a raster of 1490 pixels and I want to split it into tiles of 250x250 pixels. To get tiles of equal size, I would want to increase the width of the raster to 1500 pixels to fit in exactly 6 tiles. To do so, I need to increase the size of the raster by 10 pixels.
I am currently opening the raster with the rasterio library, which returns a NumPy ndarray. Is there a function to add a buffer around this array? The goal would be something like this:
import numpy as np
a = np.array([
[1,4,5],
[4,5,5],
[1,2,2]
])
a_with_buffer = a.buffer(a, 1) # 2nd argument refers to the buffer size
Then a_with_buffer would look as following:
[0,0,0,0,0]
[0,1,4,5,0],
[0,4,5,5,0],
[0,1,2,2,0],
[0,0,0,0,0]
You can use np.pad:
>>> np.pad(a, 1)
array([[0, 0, 0, 0, 0],
[0, 1, 4, 5, 0],
[0, 4, 5, 5, 0],
[0, 1, 2, 2, 0],
[0, 0, 0, 0, 0]])
you can create np.zeros then insert a in the index what you want like below.
Try this:
>>> a = np.array([[1,4,5],[4,5,5],[1,2,2]])
>>> b = np.zeros((5,5))
>>> b[1:1+a.shape[0],1:1+a.shape[1]] = a
>>> b
array([[0., 0., 0., 0., 0.],
[0., 1., 4., 5., 0.],
[0., 4., 5., 5., 0.],
[0., 1., 2., 2., 0.],
[0., 0., 0., 0., 0.]])
Given the following numpy arrays:
import numpy
a=numpy.array([[1,1,1],[1,1,1],[1,1,1]])
b=numpy.array([[2,2,2],[2,2,2],[2,2,2]])
c=numpy.array([[3,3,3],[3,3,3],[3,3,3]])
and this dictionary containing them all:
mydict={0:a,1:b,2:c}
What is the most efficient way of iterating through mydict so to compute the average numpy array that has (1+2+3)/3=2 as values?
My attempt fails as I am giving it too many values to unpack. It is also extremely inefficient as it has an O(n^3) time complexity:
aver=numpy.empty([a.shape[0],a.shape[1]])
for c,v in mydict.values():
for i in range(0,a.shape[0]):
for j in range(0,a.shape[1]):
aver[i][j]=mydict[c][i][j] #<-too many values to unpack
The final result should be:
In[17]: aver
Out[17]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
EDIT
I am not looking for an average value for each numpy array. I am looking for an average value for each element of my colleciton of numpy arrays. This is a minimal example, but the real thing I am working on has over 120,000 elements per array, and for the same position the values change from array to array.
I think you're making this harder than it needs to be. Either sum them and divide by the number of terms:
In [42]: v = mydict.values()
In [43]: sum(v) / len(v)
Out[43]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
Or stack them into one big array -- which it sounds like is the format they probably should have been in to start with -- and take the mean over the stacked axis:
In [44]: np.array(list(v)).mean(axis=0)
Out[44]:
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
You really shouldn't be using a dict of numpy.arrays. Just use a multi-dimensional array:
>>> bigarray = numpy.array([arr.tolist() for arr in mydict.values()])
>>> bigarray
array([[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]],
[[3, 3, 3],
[3, 3, 3],
[3, 3, 3]]])
>>> bigarray.mean(axis=0)
array([[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]])
>>>
You should modify your code to not even work with a dict. Especially not a dict with integer keys...
I have a numpy array, a list of start/end indexes that define ranges within the array, and a list of values, where the number of values is the same as the number of ranges. Doing this assignment in a loop is currently very slow, so I'd like to assign the values to the corresponding ranges in the array in a vectorized way. Is this possible to do?
Here's a concrete, simplified example:
a = np.zeros([10])
Here's the list of start and a list of end indexes that define ranges within a, like this:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
And here's a list of values I'd like to assign to each range:
values = [1, 2, 3, 4]
I have two problems. The first is that I can't figure out how to index into the array using multiple slices at the same time, since the list of ranges is constructed dynamically in the actual code. Once I'm able to extract the ranges, I'm not sure how to assign multiple values at once - one value per range.
Here's how I've tried creating a list of slices and the problems I've run into when using that list to index into the array:
slices = [slice(start, end) for start, end in zip(starts, ends)]
In [97]: a[slices]
...
IndexError: too many indices for array
In [98]: a[np.r_[slices]]
...
IndexError: arrays used as indices must be of integer (or boolean) type
If I use a static list, I can extract multiple slices at once, but then assignment doesn't work the way I want:
In [106]: a[np.r_[0:2, 2:4, 4:6, 6:8]] = [1, 2, 3]
/usr/local/bin/ipython:1: DeprecationWarning: assignment will raise an error in the future, most likely because your index result shape does not match the value array shape. You can use `arr.flat[index] = values` to keep the old behaviour.
#!/usr/local/opt/python/bin/python2.7
In [107]: a
Out[107]: array([ 1., 2., 3., 1., 2., 3., 1., 2., 0., 0.])
What I actually want is this:
np.array([1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
This will do the trick in a fully vectorized manner:
counts = ends - starts
idx = np.ones(counts.sum(), dtype=np.int)
idx[np.cumsum(counts)[:-1]] -= counts[:-1]
idx = np.cumsum(idx) - 1 + np.repeat(starts, counts)
a[idx] = np.repeat(values, count)
One possibility is to zip the start, end index with the values and broadcast the index and values manually:
starts = [0, 2, 4, 6]
ends = [2, 4, 6, 8]
values = [1, 2, 3, 4]
a = np.zeros(10)
import numpy as np
# calculate the index array and value array by zipping the starts, ends and values and expand it
idx, val = zip(*[(list(range(s, e)), [v] * (e-s)) for s, e, v in zip(starts, ends, values)])
# assign values
a[np.array(idx).flatten()] = np.array(val).flatten()
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
Or write a for loop to assign values one range by another:
for s, e, v in zip(starts, ends, values):
a[slice(s, e)] = v
a
# array([ 1., 1., 2., 2., 3., 3., 4., 4., 0., 0.])
I am having issues understanding how X and y are referenced for training.
I have a simple csv file with 5 numeric columns that I am loading into a NumPy array as follows:
url = "http://www.xyz/shortDataFinal.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
print(dataset.shape)
# separate the data from the target attributes
X = dataset[:,0:3] #Does this mean columns 1-4?
y = dataset[:,4] #Is this the 5th column?
I think I am referencing my X values incorrectly.
Here is what I need:
X values reference columns 1-4 and my y value is the last column, which is the 5th. If I understand correctly, I should be referencing array indices 0:3 for the X values and number 4 for the y as I have done above. However, those values aren't correct. In other words, the values returned by the array don't match the values in the data - they are off by one column (index).
Yes, your interpretation is correct. dataset is a matrix in this case, so the numpy indexing operators ([]) use the conventional row, column format.
X = dataset[:,0:3] is interpreted as "All rows for columns 0 through 3" and y = dataset[:,4] is interpreted as "all rows for column 4".
Using a multiline string as a standin for a csv file:
In [332]: txt=b"""0, 1, 2, 4, 5
.....: 6, 7, 8, 9, 10
.....: """
In [333]: data=np.loadtxt(txt.splitlines(), delimiter=',')
In [334]: data
Out[334]:
array([[ 0., 1., 2., 4., 5.],
[ 6., 7., 8., 9., 10.]])
In [335]: data.shape
Out[335]: (2, 5)
In [336]: data[:,0:4]
Out[336]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
In [337]: data[:,4]
Out[337]: array([ 5., 10.])
numpy indexing starts at 0; [0:4] is the same (more or less) as the list of numbers starting at 0, up to, but not including 4.
In [339]: np.arange(0,4)
Out[339]: array([0, 1, 2, 3])
Another way to get all but the last column is to use -1 indexing
In [352]: data[:,:-1]
Out[352]:
array([[ 0., 1., 2., 4.],
[ 6., 7., 8., 9.]])
Often a CSV file is a mix of numeric and string values. The loadtxt dtype parameter has a short explanation of how you can load and access that as a structured array. genfromtxt is easier to use for that (though no less confusing).