grouping data in arrays (python) - python

I'm trying to make a nice ordered way of grouping objects in an array. Now I've tried the following, but it gives me an error.
Any tips?
#body: mass, [x,y], [vx,vy], [ax, ay]
bodies = np.array([[1E3, [0,0], [0,0], [0.0]],\
[1, [0,200], [31.6,0], [0,0]]])
ValueError: setting an array element with a sequence.

You can use dtype=object, and then store anything you want—floats, tuples, lists, arrays. But really, that's not a good idea; you pretty much lose all the benefits of numpy.
And, because it's a bad idea, numpy doesn't make it easy for you. If you construct an array out of a list, it assumes any sub-lists are dimensions of the array, and if it can't make any sense of things that way, it gives you this error.
Why not just store the bodies as flat rows of numbers? You already have to interpret the rows as bodies at a higher level, and really, how is x, y = bodies[1][1] any better than x, y = bodies[1][1:3]?
If you really want to, you could create an array with one more dimension, but… why?
You also might want to consider using pandas instead of raw numpy, or using a database instead of using numpy in the first place, or just keeping each body as a Python object (whether sticking them in a numpy array or not), or something else entirely. Without knowing what you're trying to accomplish, it's hard to be sure what fits your needs. But it's pretty unlikely that what you're trying to do is the right thing to do.

You could use a complex dtype:
bodies = np.array([(1E3, [0,0], [0,0], [0.0]),
(1, [0,200], [31.6,0], [0,0])],
dtype=[('mass',float), ('xy','2float'),
('vxy','2float'), ('axy','2float')])
and access the "columns" with
In [63]: bodies['mass']
Out[63]: array([ 1000., 1.])
In [64]: bodies['xy']
Out[64]:
array([[ 0., 0.],
[ 0., 200.]])
etc.,
but this will not make your life any easier.
I am making an n-body simulator
Calculating distances between objects will be a common operation in an n-body simulator. You might want to use scipy.spatial.distance pdist or cdist for that. Notice that these functions expect X to be NumPy ndarrays of simple, homogenous dtype. So if you were to use an array of complex dtype, you'd always have to slice it first before you could use any of these functions.
Therefore, it probably would be simpler to just store arrays of simple, homogeneous dtype from the beginning and avoid the array of complex dtype.
I suggest making multiple 1-dimensional arrays:
mass = np.array(...)
x = np.array(...)
y = np.array(...)
or maybe use some 2D-arrays of simple, homogenous dtype:
pos = np.array([(x0, y0), (x1, y1), ...], dtype='float')
All your equations will be more readable this way too.
Instead of accessing the 2D position array with bodies['xy'] you would simply write pos. That's one less set of brackets your eyes will have to parse.

Related

How to vectorize a 2D scalar function over a mesh

I have a function foo(x,y) that takes two scalars (or lists of scalars) and returns a scalar output (or list of scalars computed pairwise from the input). I want to be able to evaluate this function over 2 orthogonal arrays such that the output is a matrix ij of foo(x[i], y[j]).
I have a for-loop version that solves this problem as below:
import numpy as np
x = np.range(50) # Could be linspaces, whatever the axis in the vector space is
y = np.range(50)
mat = np.zeros(len(x), len(y)) # To hold the result for plotting
for i in range(len(x)):
for j in range(len(y)):
mat[i][j] = foo(x[i], y[j])
where my result is stored in mat. However, this is dreadfully slow, and looks to me as if it could easily be vectorized. I'm not aware of how Python solves this problem however, as this doesn't appear to be something like zip or map. Is there another such function or concept (beyond trivially making extremely long arrays of the same array rotated by a value and passing them that way) that could vectorize this successfully? Or is the nature of the foo function limiting the ability to vectorize this?
In this case, itertools.product is the tool you want. It generates an iterable sequence of elements from the Cartesian product of N inputs, which you can use to discretely map a vector space. You can then evaluate foo on these. This isn't vectorization per se, but does reduce the nested for loop.
See docs at https://docs.python.org/3/library/itertools.html#itertools.product

What is the fastest way to read in an image to an array of tuples?

I am trying to assign provinces to an area for use in a game mod. I have two separate maps for area and provinces.
provinces file,
area file.
Currently I am reading in an image in Python and storing it in an array using PIL like this:
import PIL
land_prov_pic = Image.open(INPUT_FILES_DIR + land_prov_str)
land_prov_array = np.array(land_prov_pic)
image_size = land_prov_pic.size
for x in range(image_size[0]):
if x % 100 == 0:
print(x)
for y in range(image_size[1]):
land_prov_array[x][y] = land_prov_pic.getpixel((x,y))
Where you end up with land_prov_array[x][y] = (R,G,B)
However, this get's really slow, especially for large images. I tried reading it in using opencv like this:
import opencv
land_prov_array = cv2.imread(INPUT_FILES_DIR + land_prov_str)
land_prov_array = cv2.cvtColor(land_prov_array, cv2.COLOR_BGR2RGB) #Convert from BGR to RGB
But now land_prov_array[x][y] = [R G B] which is an ndarray and can't be inserted into a set. But it's way faster than the previous for loop. How do I convert [R G B] to (R,G,B) for every element in the array without for loops or, better yet, read it in that way?
EDIT: Added pictures, more description, and code blocks for readability.
It is best to convert the [R,G,B] array to tuple when you need it to be a tuple, rather than converting the whole image to this form. An array of tuples takes up a lot more memory, and will be a lot slower to process, than a numeric array.
The answer by isCzech shows how to create a NumPy view over a 3D array that presents the data as if it were a 2D array of tuples. This might not require the additional memory of an actual array of tuples, but it is still a lot slower to process.
Most importantly, most NumPy functions (such as np.mean) and operators (such as +) cannot be applied to such an array. Thus, one is obliged to iterate over the array in Python code (or with a #np.vectorize function), which is a lot less efficient than using NumPy functions and operators that work on the array as a whole.
For transformation from a 3D array (data3D) to a 2D array (data2D), I've used this approach:
import numpy as np
dt = np.dtype([('x', 'u1'), ('y', 'u1'), ('z', 'u1')])
data2D = data3D.view(dtype=dt).squeeze()
The .view modifies the data type and returns still a 3D array with the last dimension of size 1 which can be then removed by .squeeze. Alternatively you can use .squeeze(axis=-1) to only squeeze the last dimension (in case some of your other dimensions are of size 1 too).
Please note I've used uint8 ('u1') - your type may be different.
Trying to do this using a loop is very slow, indeed (compared to this approach at least).
Similar question here: Show a 2d numpy array where contents are tuples as an image

Assigning values to list slices of large dense square matrices (Python)

I'm dealing with large dense square matrices of size NxN ~(100k x 100k) that are too large to fit into memory.
After doing some research, I've found that most people handle large matrices by either using numpy's memap or the pytables package. However, I've found that these packages seem to have major limitations. Neither of them seem to offer support ASSIGN values to list slices to the matrix on the disk along more than one dimension.
I would like to look for an efficient way to assign values to a large dense square matrix M with something like:
M[0, [1,2,3], [8,15,30]] = np.zeros((3, 3)) # or
M[0, [1,2,3,1,2,3,1,2,3], [8,8,8,15,15,15,30,30,30]] = 0 # for memmap
With memmap, the expression M[0, [1,2,3], [8,15,30]] would always copy the slice into RAM hence assignment doesn't seem to work.
With pytables, list slicing along more than 1 dimension is not supported. Currently I'm just slicing along 1 dimension following by the other dimension (i.e. M[0, [1,2,3]][:, [8,15,30]]). RAM usage of this solution would scale with N, which is better than dealing with the whole array (N^2) but is still not ideal.
In addition, it appears that pytables isn't the most efficient way of handling matrices with lots of rows. (or could there be a way of specifying the chunksize to get rid of this message?) I am getting the following warning message:
The Leaf ``/M`` is exceeding the maximum recommended rowsize (104857600 bytes);
be ready to see PyTables asking for *lots* of memory and possibly slow
I/O. You may want to reduce the rowsize by trimming the value of
dimensions that are orthogonal (and preferably close) to the *main*
dimension of this leave. Alternatively, in case you have specified a
very small/large chunksize, you may want to increase/decrease it.
I'm just wonder whether there are better solutions to assign values to arbitrary 2d slices of large matrices?
First of all, note that in numpy (not sure about pytables) M[0, [1,2,3], [8,15,30]] will return an array of shape (3,) corresponding to elements M[0,1,8], M[0,2,15] and M[0,3,30], so assigning np.zeros((3,3)) to that will raise an error.
Now, the following works fine with me:
np.save('M.npy', np.random.randn(5,5,5)) # create some dummy matrix
M = np.load('M.npy', mmap_mode='r+') # load such matrix as a memmap
M[[0,1,2],[1,2,3],[2,3,4]] = 0
M.flush() # make sure thing is updated on disk
del M
M = np.load('M.npy', mmap_mode='r+') # re-load matrix
print(M[[0,1,2],[1,2,3],[2,3,4]]) # should show array([0., 0., 0.])

Numpy array and matrix multiplication

I am trying to get rid of the for loop and instead do an array-matrix multiplication to decrease the processing time when the weights array is very large:
import numpy as np
sequence = [np.random.random(10), np.random.random(10), np.random.random(10)]
weights = np.array([[0.1,0.3,0.6],[0.5,0.2,0.3],[0.1,0.8,0.1]])
Cov_matrix = np.matrix(np.cov(sequence))
results = []
for w in weights:
result = np.matrix(w)*Cov_matrix*np.matrix(w).T
results.append(result.A)
Where:
Cov_matrix is a 3x3 matrix
weights is an array of n lenght with n 1x3 matrices in it.
Is there a way to multiply/map weights to Cov_matrix and bypass the for loop? I am not very familiar with all the numpy functions.
I'd like to reiterate what's already been said in another answer: the np.matrix class has much more disadvantages than advantages these days, and I suggest moving to the use of the np.array class alone. Matrix multiplication of arrays can be easily written using the # operator, so the notation is in most cases as elegant as for the matrix class (and arrays don't have several restrictions that matrices do).
With that out of the way, what you need can be done in terms of a call to np.einsum. We need to contract certain indices of three matrices while keeping one index alone in two matrices. That is, we want to perform w_{ij} * Cov_{jk} * w.T_{ki} with a summation over j, k, giving us an array with i indices. The following call to einsum will do:
res = np.einsum('ij,jk,ik->i', weights, Cov_matrix, weights)
Note that the above will give you a single 1d array, whereas you originally had a list of arrays with shape (1,1). I suspect the above result will even make more sense. Also, note that I omitted the transpose in the second weights argument, and this is why the corresponding summation indices appear as ik rather than ki. This should be marginally faster.
To prove that the above gives the same result:
In [8]: results # original
Out[8]: [array([[0.02803215]]), array([[0.02280609]]), array([[0.0318784]])]
In [9]: res # einsum
Out[9]: array([0.02803215, 0.02280609, 0.0318784 ])
The same can be achieved by working with the weights as a matrix and then looking at the diagonal elements of the result. Namely:
np.diag(weights.dot(Cov_matrix).dot(weights.transpose()))
which gives:
array([0.03553664, 0.02394509, 0.03765553])
This does more calculations than necessary (calculates off-diagonals) so maybe someone will suggest a more efficient method.
Note: I'd suggest slowly moving away from np.matrix and instead work with np.array. It takes a bit of getting used to not being able to do A*b but will pay dividends in the long run. Here is a related discussion.

Difference between numpy's np.transpose(matrix) and np.matrix.transpose() on a 2D matrix?

Is there a functional difference between numpy's np.transpose(matrix) and np.matrix.transpose() on a 2D matrix, given no axes specified in either?
Also, could someone try to intuitively explain how the axes specification works?
Thanks!
In [16]: np.matrix.transpose
Out[16]: <method 'transpose' of 'numpy.ndarray' objects>
This is the same as np.ndarray.transpose. In other words it's the transpose method that np.matrix subclass inherits from the parent np.ndarray (what we typically call numpy array).
np.transpose is the function equivalent, which ends up calling the transpose method. It's different in that it can convert its input to an array first, e.g. if the input is a list. Lists, of course, don't have a transpose method.
As for what it does - with a 2d array it does the usual mathematically defined transpose. What can confuse users from other languages, is that it generalized the concept to 1d, 3d, and higher. While you are at it make sure you understand the difference between a 1d and a 2d array. If you aren't clear about that, transpose will remain a puzzle.
Transpose and its axes parameter has been explained in other SO question. I'd suggest reading the docs along side an interactive session in which you can play with arrays with various dimensions.
In [23]: x = np.arange(24).reshape(2,3,4)
In [24]: x.transpose().shape
Out[24]: (4, 3, 2)
In [25]: x.transpose(0,2,1).shape
Out[25]: (2, 4, 3)
Don't try to carry over MATLAB intuitions.

Categories