I am trying to use python to compute the output of a function, say:
$f(x) = x + y$
Where x and y are the coordinates of the point in the array. So, the point 5, 5 would have the value 10. This will essentially generate an image of (x,y) and an associated pixel intensity value.
Right now I have a 100x100 dataframe in Python/Pandas, and want to know how to actually perform this calculation. My best guess is iterate over each row, and using the index of the row (y) and the index of the element (x), pass these two values into the function and set the point to that value.
This is essentially a basic multivariable graphing problem. Was hoping someone had some experience doing stuff like this. Thank you!
There are numpy functions fromfunction and indices. They'll probably do what you want.
import numpy as np
np.fromfunction( lambda r, c: r+c, shape = (5,5))
# array([[0., 1., 2., 3., 4.],
# [1., 2., 3., 4., 5.],
# [2., 3., 4., 5., 6.],
# [3., 4., 5., 6., 7.],
# [4., 5., 6., 7., 8.]])
fromfunction takes a function as the first argument then the shape. It uses the axes' indices in the function. The function requires as many arguments as there are dims in the shape.
np.indices((3,3))
# array([[[0, 0, 0], # Row coordinates
# [1, 1, 1],
# [2, 2, 2]],
#
# [[0, 1, 2], # Column coordinates
# [0, 1, 2],
# [0, 1, 2]]])
These can be used as function arguments to drive your results.
There are also np.ogrid and np.mgrid which generate np.arrays to use in any calculations. A lot depends on exactly what you want to do.
Edit: np.fromfunction with keyword arguments.
def test ( a, b, c, m0=1, m1 =1): # Specify function with kwargs.
return a * m0 + b * m1 + c
np.fromfunction(test, (4, 3, 5 ), m0=100, m1=10) # Change he kwargs at run time.
# array([[[ 0., 1., 2., 3., 4.],
# [ 10., 11., 12., 13., 14.],
# [ 20., 21., 22., 23., 24.]],
# [[100., 101., 102., 103., 104.],
# [110., 111., 112., 113., 114.],
# [120., 121., 122., 123., 124.]],
# [[200., 201., 202., 203., 204.],
# [210., 211., 212., 213., 214.],
# [220., 221., 222., 223., 224.]],
# [[300., 301., 302., 303., 304.],
# [310., 311., 312., 313., 314.],
# [320., 321., 322., 323., 324.]]])
Related
I have two different array processing problems that I'd like to solve AQAP (Q=quickly) to ensure that the solutions aren't rate-limiting in my process (using NEAT to train a video game bot). In one case, I want to build a penalty function for making larger column heights, and in the other I want to reward building "islands of a common value.
Operations begin on a 26 row x 6 column numpy array of grayscale values with a black/0 background.
I have working solutions for each problem that already implement some numpy, but I'd like to push for a fully vectorized approach to both.
import numpy as np,
from scipy.ndimage.measurements import label as sp_label
from math import ceil
Both problems start from an array like this:
img= np.array([[ 0., 0., 0., 12., 0., 0.],
[ 0., 0., 0., 14., 0., 0.],
[ 0., 0., 0., 14., 0., 0.],
[ 0., 0., 0., 14., 0., 0.],
[16., 0., 0., 14., 0., 0.],
[16., 0., 0., 12., 0., 0.],
[12., 0., 11., 0., 0., 0.],
[12., 0., 11., 0., 0., 0.],
[16., 0., 15., 0., 15., 0.],
[16., 0., 15., 0., 15., 0.],
[14., 0., 12., 0., 11., 0.],
[14., 0., 12., 0., 11., 0.],
[14., 15., 11., 0., 11., 0.],
[14., 15., 11., 0., 11., 0.],
[13., 16., 12., 0., 13., 0.],
[13., 16., 12., 0., 13., 0.],
[13., 14., 16., 0., 16., 0.],
[13., 14., 16., 0., 16., 0.],
[16., 14., 15., 0., 14., 0.],
[16., 14., 15., 0., 14., 0.],
[14., 16., 14., 0., 11., 0.],
[14., 16., 14., 0., 11., 0.],
[11., 13., 14., 16., 12., 13.],
[11., 13., 14., 16., 12., 13.],
[12., 12., 15., 14., 15., 11.],
[12., 12., 15., 14., 15., 11.]])
The first (column height) problem is currently being solved with:
# define valid connection directions for sp_label
c_valid_conns = np.array((0,1,0,0,1,0,0,1,0,), dtype=np.int).reshape((3,3))
# run the island labeling function sp_label
# c_ncomponents is a simple count of the conected columns in labeled
columns, c_ncomponents = sp_label(img, c_valid_conns)
# calculate out the column lengths
col_lengths = np.array([(columns[columns == n]/n).sum() for n in range(1, c_ncomponents+1)])
col_lengths
to give me this array: [ 6. 22. 20. 18. 14. 4. 4.]
(bonus if the code consistently ignores the labeled region that does not "contain" the bottom of the array (row index 25/-1))
The second problem involves masking for each unique value and calculating the contiguous bodies in each masked array to get me the size of the contiguous bodies:
# initial values to start the ball rolling
values = [11, 12, 13, 14, 15, 16]
isle_avgs_i = [1.25, 2, 0, 1,5, 2.25, 1]
# apply filter masks to img to isolate each value
# Could these masks be pushed out into a third array dimension instead?
masks = [(img == g) for g in np.unique(values)]
# define the valid connectivities (8-way) for the sp_label function
m_valid_conns = np.ones((3,3), dtype=np.int)
# initialize islanding lists
# I'd love to do away with these when I no longer need the .append() method)
mask_isle_avgs, isle_avgs = [],[]
# for each mask in the image:
for i, mask in enumerate(masks):
# run the island labeling function sp_label
# m_labeled is the array containing the sequentially labeled islands
# m_ncomponents is a simple count of the islands in m_labeled
m_labeled, m_ncomponents = sp_label(mask, m_valid_conns)
# collect the average (island size-1)s (halving to account for...
# ... y resolution) for each island into mask_isle_avgs list
# I'd like to vectorize this step
mask_isle_avgs.append((sum([ceil((m_labeled[m_labeled == n]/n).sum()/2)-1
for n in range(1, m_ncomponents+1)]))/(m_ncomponents+1))
# add up the mask isle averages for all the islands...
# ... and collect into isle_avgs list
# I'd like to vectorize this step
isle_avgs.append(sum(mask_isle_avgs))
# initialize a difference list for the isle averages (I also want to do away with this step)
d_avgs = []
# evaluate whether isle_avgs is greater for the current frame or the...
# ... previous frame (isle_avgs_i) and append either the current...
# ... element or 0, depending on whether the delta is non-negative
# I want this command vectorized
[d_avgs.append(isle_avgs[j])
if (isle_avgs[j]-isle_avgs_i[j])>=0
else d_avgs.append(0) for j in range(len(isle_avgs))]
d_avgs
to give me this d_avgs array: [0, 0, 0.46785714285714286, 1.8678571428571429, 0, 0]
(bonus again if the code consistently ignores the labeled region that does not "contain" the bottom of the array (row index 25/-1) to instead give this array:
[0, 0, 0.43452380952380953, 1.6345238095238095, 0, 0] )
I'm looking to remove any list operations and comprehensions and move them into fully vectorized numpy/scipy implementation with the same results.
Any help removing any of these steps would be greatly appreciated.
Here's how I ultimately solved this issue:
######## column height penalty calculation ########
# c_ncomponents is a simple count of the conected columns in labeled
columns, c_ncomponents = sp_label(unit_img, c_valid_conns)
# print(columns)
# throw out the falling block with .isin(x,x[-1]) combined with...
# the mask nonzero(x)
drop_falling = np.isin(columns, columns[-1][np.nonzero(columns[-1])])
col_hts = drop_falling.sum(axis=0)
# print(f'col_hts {col_hts}')
# calculate differentials for the (grounded) column heights
d_col_hts = np.sum(col_hts - col_hts_i)
# print(f'col_hts {col_hts} - col_hts_i {col_hts_i} ===> d_col_hts {d_col_hts}')
# set col_hts_i to current col_hts for next evaluation
col_hts_i = col_hts
# calculate penalty/bonus function
# col_pen = (col_hts**4 - 3**4).sum()
col_pen = np.where(d_col_hts > 0, (col_hts**4 - 3**4), 0).sum()
#
# if col_pen !=0:
# print(f'col_pen: {col_pen}')
######## end column height penalty calculation ########
######## color island bonus calculation ########
# mask the unit_img to remove the falling block
isle_img = drop_falling * unit_img
# print(isle_img)
# broadcast the game board to add a layer for each color
isle_imgs = np.broadcast_to(isle_img,(7,*isle_img.shape))
# define a mask to discriminate on color in each layer
isle_masked = isle_imgs*[isle_imgs==ind_grid[0]]
# reshape the array to return to 3 dimensions
isle_masked = isle_masked.reshape(isle_imgs.shape)
# generate the isle labels
isle_labels, isle_ncomps = sp_label(isle_masked, i_valid_conns)
# determine the island sizes (via return_counts) for all the unique labels
isle_inds, isle_sizes = np.unique(isle_labels, return_counts=True)
# zero out isle_sizes[0] to remove spike for background (500+ for near empty board)
isle_sizes[0] = 0
# evaluate difference to determine whether bonus applies
if isle_sizes_i.sum() != isle_sizes.sum():
# calculate bonus for all island sizes ater throwing away the 0 count
isle_bonus = (isle_sizes**3).sum()
else:
isle_bonus = 0
As the question says, what does -1 do in pytorch view?
>>> a = torch.arange(1, 17)
>>> a
tensor([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16.])
>>> a.view(1,-1)
tensor([[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16.]])
>>> a.view(-1,1)
tensor([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.],
[ 6.],
[ 7.],
[ 8.],
[ 9.],
[ 10.],
[ 11.],
[ 12.],
[ 13.],
[ 14.],
[ 15.],
[ 16.]])
Does it (-1) generate additional dimension?
Does it behave the same as numpy reshape -1?
Yes, it does behave like -1 in numpy.reshape(), i.e. the actual value for this dimension will be inferred so that the number of elements in the view matches the original number of elements.
For instance:
import torch
x = torch.arange(6)
print(x.view(3, -1)) # inferred size will be 2 as 6 / 3 = 2
# tensor([[ 0., 1.],
# [ 2., 3.],
# [ 4., 5.]])
print(x.view(-1, 6)) # inferred size will be 1 as 6 / 6 = 1
# tensor([[ 0., 1., 2., 3., 4., 5.]])
print(x.view(1, -1, 2)) # inferred size will be 3 as 6 / (1 * 2) = 3
# tensor([[[ 0., 1.],
# [ 2., 3.],
# [ 4., 5.]]])
# print(x.view(-1, 5)) # throw error as there's no int N so that 5 * N = 6
# RuntimeError: invalid argument 2: size '[-1 x 5]' is invalid for input with 6 elements
print(x.view(-1, -1, 3)) # throw error as only one dimension can be inferred
# RuntimeError: invalid argument 1: only one dimension can be inferred
I love the answer that Benjamin gives https://stackoverflow.com/a/50793899/1601580
Yes, it does behave like -1 in numpy.reshape(), i.e. the actual value for this dimension will be inferred so that the number of elements in the view matches the original number of elements.
but I think the weird case edge case that might not be intuitive for you (or at least it wasn't for me) is when calling it with a single -1 i.e. tensor.view(-1).
My guess is that it works exactly the same way as always except that since you are giving a single number to view it assumes you want a single dimension. If you had tensor.view(-1, Dnew) it would produce a tensor of two dimensions/indices but would make sure the first dimension to be of the correct size according to the original dimension of the tensor. Say you had (D1, D2) you had Dnew=D1*D2 then the new dimension would be 1.
For real examples with code you can run:
import torch
x = torch.randn(1, 5)
x = x.view(-1)
print(x.size())
x = torch.randn(2, 4)
x = x.view(-1, 8)
print(x.size())
x = torch.randn(2, 4)
x = x.view(-1)
print(x.size())
x = torch.randn(2, 4, 3)
x = x.view(-1)
print(x.size())
output:
torch.Size([5])
torch.Size([1, 8])
torch.Size([8])
torch.Size([24])
History/Context
I feel a good example (common case early on in pytorch before the flatten layer was official added was this common code):
class Flatten(nn.Module):
def forward(self, input):
# input.size(0) usually denotes the batch size so we want to keep that
return input.view(input.size(0), -1)
for sequential. In this view x.view(-1) is a weird flatten layer but missing the squeeze (i.e. adding a dimension of 1). Adding this squeeze or removing it is usually important for the code to actually run.
Example2
if you are wondering what x.view(-1) does it flattens the vector. Why? Because it has to construct a new view with only 1 dimension and infer the dimension -- so it flattens it. In addition it seems this operation avoids the very nasty bugs .resize() brings since the order of the elements seems to be respected. Fyi, pytorch now has this op for flattening: https://pytorch.org/docs/stable/generated/torch.flatten.html
#%%
"""
Summary: view(-1, ...) keeps the remaining dimensions as give and infers the -1 location such that it respects the
original view of the tensor. If it's only .view(-1) then it only has 1 dimension given all the previous ones so it ends
up flattening the tensor.
ref: my answer https://stackoverflow.com/a/66500823/1601580
"""
import torch
x = torch.arange(6)
print(x)
x = x.reshape(3, 2)
print(x)
print(x.view(-1))
output
tensor([0, 1, 2, 3, 4, 5])
tensor([[0, 1],
[2, 3],
[4, 5]])
tensor([0, 1, 2, 3, 4, 5])
see the original tensor is returned!
I guess this works similar to np.reshape:
The new shape should be compatible with the original shape. If an integer, then the result will be a 1-D array of that length. One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions.
If you have a = torch.arange(1, 18) you can view it various ways like a.view(-1,6),a.view(-1,9), a.view(3,-1) etc.
From the PyTorch documentation:
>>> x = torch.randn(4, 4)
>>> x.size()
torch.Size([4, 4])
>>> y = x.view(16)
>>> y.size()
torch.Size([16])
>>> z = x.view(-1, 8) # the size -1 is inferred from other dimensions
>>> z.size()
torch.Size([2, 8])
-1 infers to 2, for instance, if you have
>>> a = torch.rand(4,4)
>>> a.size()
torch.size([4,4])
>>> y = x.view(16)
>>> y.size()
torch.size([16])
>>> z = x.view(-1,8) # -1 is generally inferred as 2 i.e (2,8)
>>> z.size()
torch.size([2,8])
-1 is a PyTorch alias for "infer this dimension given the others have all been specified" (i.e. the quotient of the original product by the new product). It is a convention taken from numpy.reshape().
Hence t.view(1,17) in the example would be equivalent to t.view(1,-1) or t.view(-1,17).
Data is in a structured array:
import numpy as np
dtype = [(field, float) for field in ['x', 'y', 'z', 'prop1', 'prop2']]
data = np.array([(1,2,3,4,5), (6,7,8,9,10), (11,12,13,14,15)], dtype=dtype)
For some operations, the positions are accessed as a single nx3 array, for example:
positions = data[['x', 'y', 'z']].view(dtype=float).reshape(-1, 3)
ranges = np.sqrt(np.sum(positions**2, 1))
Since numpy 1.12, the following warning is emitted:
FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a
structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
Here is the corresponding entry in the release notes:
Indexing a structured array with multiple fields (eg, arr[['f1', 'f3']]) will return a view into the original array in 1.13, instead of a copy. Note the returned view will have extra padding bytes corresponding to intervening fields in the original array, unlike the copy in 1.12, which will affect code such as arr[['f1', 'f3']].view(newdtype).
How to port this code to numpy >=1.13?
Checking on numpy 1.13 the announced change doesn't appear to have happened yet. So let's simulate the future:
The future behavior will presumably be not to copy the data but to create a dtype that has only the fields you want, but the itemsize of the original dtype. So there will be gaps in each element, parts of memory that are not used.
xyz_tp = xyz_tp = np.dtype({'names': list('xyz'),
'formats': tuple(data.dtype.fields[f][0] for f in 'xyz'),
'offsets': tuple(data.dtype.fields[f][1] for f in 'xyz'),
'itemsize': data.dtype.itemsize})
xyz = data.view(xyz_tp)
xyz
# array([( 1., 2., 3.), ( 6., 7., 8.), ( 11., 12., 13.)],
# dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[0,8,16], 'itemsize':40})
The not used memory locations and their content are ignored but still there, so if you view with a builtin dtype they'll reappear.
xyz.view(float)
# array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
# 12., 13., 14., 15.])
# Ouch!
The general fix would be to cast to a contiguous (no gaps) dtype with the same fields. This will force a copy
xyz_cont_tp = np.dtype({'names': list('xyz'), 'formats': 3*('<f8',)})
xyz.astype(xyz_cont_tp).view(float).reshape(-1, 3)
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
In the special case of your selected fields being contiguous and of same type you can also do:
np.lib.stride_tricks.as_strided(data.view(float), shape=(3,3), strides=data.strides + (8,))
# array([[ 1., 2., 3.],
# [ 6., 7., 8.],
# [ 11., 12., 13.]])
This method does not copy data but creates a genuine view.
Other way for several adjacent float fields. Here for 3 fields starting from 'x' we obtain same result with:
np.ndarray((len(data),3), float, data, offset= data.dtype.fields['x'][1], strides= (data.strides[0], np.dtype(float).itemsize))
The following code works fine when the tensor is a dmatrix as follows:
A = T.dmatrix('A') # Input tensor
X, updates = theano.scan(lambda i: T.sum((A+A[i])*(T.neq(A*A[i],0)), axis=1),
sequences=T.arange(A.shape[0]))
compute = function([A], X)
Sample input:
a = [[1,2,3,0,9],[3,2,6,2,7],[0,0,0,8,0],[1,0,0,0,3]]
compute(a)
Corresponding output:
array([[ 30., 33., 0., 14.],
[ 33., 40., 10., 14.],
[ 0., 10., 16., 0.],
[ 14., 14., 0., 8.]])
The real pain comes into play when I try converting this to a sparse matrix.
A = sparse.csr_matrix(name='A', dtype='int64')
The following error pops up when X is compiled:
...
...
NotImplementedError: Theano has no sparse vectorUse X[a:b, c:d], X[a:b, c:c+1] or X[a:b] instead.
I also tried substituting the addition and multiply operations in the scan function with sparse.basic.add and sparse.basic.mul respectively. No matter what I do, the above error persists.
Please help. What should I do to fix this?
In pure Python you can grow matrices column by column pretty easily:
data = []
for i in something:
newColumn = getColumnDataAsList(i)
data.append(newColumn)
NumPy's array doesn't have the append function. The hstack function doesn't work on zero sized arrays, thus the following won't work:
data = numpy.array([])
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data = numpy.hstack((data, newColumn)) # ValueError: arrays must have same number of dimensions
So, my options are either to remove the initalization iside the loop with appropriate condition:
data = None
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
if data is None:
data = newColumn
else:
data = numpy.hstack((data, newColumn)) # works
... or to use a Python list and convert is later to array:
data = []
for i in something:
newColumn = getColumnDataAsNumpyArray(i)
data.append(newColumn)
data = numpy.array(data)
Both variants seem a little bit awkward to be. Are there nicer solutions?
NumPy actually does have an append function, which it seems might do what you want, e.g.,
import numpy as NP
my_data = NP.random.random_integers(0, 9, 9).reshape(3, 3)
new_col = NP.array((5, 5, 5)).reshape(3, 1)
res = NP.append(my_data, new_col, axis=1)
your second snippet (hstack) will work if you add another line, e.g.,
my_data = NP.random.random_integers(0, 9, 16).reshape(4, 4)
# the line to add--does not depend on array dimensions
new_col = NP.zeros_like(my_data[:,-1]).reshape(-1, 1)
res = NP.hstack((my_data, new_col))
hstack gives the same result as concatenate((my_data, new_col), axis=1), i'm not sure how they compare performance-wise.
While that's the most direct answer to your question, i should mention that looping through a data source to populate a target via append, while just fine in python, is not idiomatic NumPy. Here's why:
initializing a NumPy array is relatively expensive, and with this conventional python pattern, you incur that cost, more or less, at each loop iteration (i.e., each append to a NumPy array is roughly like initializing a new array with a different size).
For that reason, the common pattern in NumPy for iterative addition of columns to a 2D array is to initialize an empty target array once(or pre-allocate a single 2D NumPy array having all of the empty columns) the successively populate those empty columns by setting the desired column-wise offset (index)--much easier to show than to explain:
>>> # initialize your skeleton array using 'empty' for lowest-memory footprint
>>> M = NP.empty(shape=(10, 5), dtype=float)
>>> # create a small function to mimic step-wise populating this empty 2D array:
>>> fnx = lambda v : NP.random.randint(0, 10, v)
populate NumPy array as in the OP, except each iteration just re-sets the values of M at successive column-wise offsets
>>> for index, itm in enumerate(range(5)):
M[:,index] = fnx(10)
>>> M
array([[ 1., 7., 0., 8., 7.],
[ 9., 0., 6., 9., 4.],
[ 2., 3., 6., 3., 4.],
[ 3., 4., 1., 0., 5.],
[ 2., 3., 5., 3., 0.],
[ 4., 6., 5., 6., 2.],
[ 0., 6., 1., 6., 8.],
[ 3., 8., 0., 8., 0.],
[ 5., 2., 5., 0., 1.],
[ 0., 6., 5., 9., 1.]])
of course if you don't known in advance what size your array should be
just create one much bigger than you need and trim the 'unused' portions
when you finish populating it
>>> M[:3,:3]
array([[ 9., 3., 1.],
[ 9., 6., 8.],
[ 9., 7., 5.]])
Usually you don't keep resizing a NumPy array when you create it. What don't you like about your third solution? If it's a very large matrix/array, then it might be worth allocating the array before you start assigning its values:
x = len(something)
y = getColumnDataAsNumpyArray.someLengthProperty
data = numpy.zeros( (x,y) )
for i in something:
data[i] = getColumnDataAsNumpyArray(i)
The hstack can work on zero sized arrays:
import numpy as np
N = 5
M = 15
a = np.ndarray(shape = (N, 0))
for i in range(M):
b = np.random.rand(N, 1)
a = np.hstack((a, b))
Generally it is expensive to keep reallocating the NumPy array - so your third solution is really the best performance wise.
However I think hstack will do what you want - the cue is in the error message,
ValueError: arrays must have same number of dimensions`
I'm guessing that newColumn has two dimensions (rather than a 1D vector), so you need data to also have two dimensions..., for example, data = np.array([[]]) - or alternatively make newColumn a 1D vector (generally if things are 1D it is better to keep them 1D in NumPy, so broadcasting, etc. work better). in which case use np.squeeze(newColumn) and hstack or vstack should work with your original definition of the data.