How to operate on one dimension of an array? - python

I'm coming from IDL, so I'm most used to for loops with explicit indicing. I have read about how python does things differently and that you should just be able to say
for thing in things:
What I can't figure out is if a I have a 4 dimensional array and I want to perform an operation in one dimension of the array, how do I save out the result in a 4 dimensional array and do it in the 'python' way.
I have a 4 dimensional array in time, altitude, latitude, longitude. I want to smooth it using a running mean window of N=9.
Here is the code that I am working with:
KMCM_T = g.variables['temperature'][:,:,:,:] #K
N = 9
T_bar_run = []
for idx, lon in enumerate(KMCM_lon):
for idy, lat in enumerate(KMCM_lat):
for idz, lev in enumerate(KMCM_levels):
T_bar_run[:][idz][idy][idx] = np.convolve(KMCM_T[:,idz,idy,idx], np.ones((N,))/N, mode='same')

In this specific case you could probably use scipy.ndimage.convolve1d:
from scipy.ndimage import convolve1d
T_bar_run = convolve1d(KMCM_T, np.ones(N)/N, axis=0, mode='constant')
The "numpy way of doing things" is avoiding loops because in numerical applications often the overhead of an interpreted loop dwarfs the cost of its payload. This is done by relying on vectorized functions, i.e. functions that apply a certain operation to every cell of its array arguments.
Many such functions act naturally along one or a few dimensions which is why you will frequently encounter the axis keyword argument.

Related

Numpy: Efficient way to create a complex array from two real arrays

I have two real arrays (a and b), and I would like create a complex array (c) which takes the two real arrays as its real and imaginary parts respectively.
The simplest one would be
c = a + b * 1.0j
However, since my data size is quite large, such code is not very efficient.
We can also do the following,
c = np.empty(data_shape)
c.real = a
c.imag = b
I am wondering is there a better way to do that (e.g. using buffer or something)?
Thank you very much!
Since the real and imaginary parts of each element have to be contiguous, you will have to allocate another buffer to interleave the data no matter what. The second method shown in the question is therefore about as efficient as you're likely to get. One alternative would be
np.stack((a, b), axis=-1).view(np.complex).squeeze(-1)
This works for any array shape, not just 1D. It ensures proper interleaving by stacking along the last dimension in C order.
This assumes that your datatype is np.float. If not, either promote to float (e.g. a = a.astype(float)), or possibly change np.complex to something else.

Computation difference between function and manual computation

I am facing a mystery right now. I get strange results in some program and I think it may be related to the computation since I got different results with my functions compared to manual computation.
This is from my program, I am printing the values pre-computation :
print("\nPrecomputation:\nmatrix\n:", matrix)
tmp = likelihood_left * likelihood_right
print("\nconditional_dep:", tmp)
print("\nfinal result:", matrix # tmp)
I got the following output:
Precomputation:
matrix:
[array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294])
array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784])
array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768])
array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674])
array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
conditional_dep: [0.01391123 0.01388155 0.17221067 0.02675524 0.01033257]
final result: [0.07995043 0.03485223 0.02184015 0.04721548 0.05323298]
The thing is when I compute the following code:
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
matrix # tmp
The values in use are exactly the same as they should be in the computation before but I get the following result:
array([0.04171218, 0.04535276, 0.02546353, 0.04688848, 0.03106443])
This result is then obviously different than the previous one and is the true one (I computed the dot product by hand).
I have been facing this problem the whole day and I did not find anything useful online. If any of you have any even tiny idea where it can come from I'd be really happy :D
Thank's in advance
Yann
PS: I can show more of the code if needed.
PS2: I don't know if it is relevant but this is used in a dynamic programming algorithm.
To recap our discussion in the comments, in the first part ("pre-computation"), the following is true about the matrix object:
>>> matrix.shape
(5,)
>>> matrix.dtype
dtype('O') # aka object
And as you say, this is due to matrix being a slice of a larger, non-uniform array. Let's recreate this situation:
>>> matrix = np.array([[], np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]), np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]), np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]), np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]), np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])])[1:]
It is now not a matrix with scalars in rows and columns, but a column vector of column vectors. Technically, matrix # tmp is an operation between two 1-D arrays and hence NumPy should, according to the documentation, calculate the inner product of the two. This is true in this case, with the convention that the sum be over the first axis:
>>> np.array([matrix[i] * tmp[i] for i in range(5)]).sum(axis=0)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
>>> matrix # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
This is essentially the same as taking the transpose of the proper 2-D matrix before the multiplication:
>>> np.stack(matrix).T # tmp
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Equivalently, as noted by #jirasssimok:
>>> tmp # np.stack(matrix)
array([0.07995043, 0.03485222, 0.02184015, 0.04721548, 0.05323298])
Hence the erroneous or unexpected result.
As you have already resolved to do in the comments, this can be avoided in the future by ensuring all matrices are proper 2-D arrays.
It looks like you got the operands switched in one of your matrix multiplications.
Using the same values of matrix and tmp that you provided, matrix # tmp and tmp # matrix provide the two results you showed.1
matrix = [np.array([0.08078721, 0.5802404 , 0.16957052, 0.09629893, 0.07310294]),
np.array([0.14633129, 0.45458744, 0.20096238, 0.02142105, 0.17669784]),
np.array([0.41198731, 0.06197812, 0.05934063, 0.23325626, 0.23343768]),
np.array([0.15686545, 0.29516415, 0.20095091, 0.14720275, 0.19981674]),
np.array([0.15965914, 0.18383683, 0.10606946, 0.14234812, 0.40808645])]
tmp = np.asarray([0.01391123, 0.01388155, 0.17221067, 0.02675524, 0.01033257])
print(matrix # tmp) # [0.04171218 0.04535276 0.02546353 0.04688848 0.03106443]
print(tmp # matrix) # [0.07995043 0.03485222 0.02184015 0.04721548 0.05323298]
To make it a little more obvious what your code is doing, you might also consider using np.dot instead of #. If you pass matrix as the first argument and tmp as the second, it will have the result you want, and make it more clear that you're conceptually calculating dot products rather than multiplying matrices.
As an additional note, if you're performing matrix operations on matrix, it might be better if it was a single two-dimensional array instead of a list of 1-dimensional arrays. this will prevent errors of the sort you'll see right now if you try to run matrix # matrix. This would also let you say matrix.dot(tmp) instead of np.dot(matrix, tmp) if you wanted to.
(I'd guess that you can use np.stack or a similar function to create matrix, or you can call np.stack on matrix after creating it.)
1 Because tmp has only one dimension and matrix has two, NumPy can and will treat tmp as whichever type of vector makes the multiplication work (using broadcasting). So tmp is treated as a column vector in matrix # tmp and a row vector in tmp # matrix.

Gensim word2vec model outputs 1000 dimension ndarray but the maximum number of ndarray dimensions is 32 - how?

I'm trying to use this 1000 dimension wikipedia word2vec model to analyze some documents.
Using introspection I found out that the vector representation of a word is a 1000 dimension numpy.ndarray, however whenever I try to create an ndarray to find the nearest words I get a value error:
ValueError: maximum supported dimension for an ndarray is 32, found 1000
and from what I can tell by looking around online 32 is indeed the maximum supported number of dimensions for an ndarray - so what gives? How is gensim able to output a 1000 dimension ndarray?
Here is some example code:
doc = [model[word] for word in text if word in model.vocab]
out = []
n = len(doc[0])
print(n)
print(len(model["hello"]))
print(type(doc[0]))
for i in range(n):
sum = 0
for d in doc:
sum += d[i]
out.append(sum/n)
out = np.ndarray(out)
which outputs:
1000
1000
<class 'numpy.ndarray'>
ValueError: maximum supported dimension for an ndarray is 32, found 1000
The goal here would be to compute the average vector of all words in the corpus in a format that can be used to find nearby words in the model so any alternative suggestions to that effect are welcome.
You're calling numpy's ndarray() constructor-function with a list that has 1000 numbers in it – your hand-calculated averages of each of the 1000 dimensions.
The ndarray() function expects its argument to be the shape of the matrix constructed, so it's trying to create a new matrix of shape (d[0], d[1], ..., d[999]) – and then every individual value inside that matrix would be addressed with a 1000-int set of coordinates. And, indeed numpy arrays can only have 32 independent dimensions.
But even if you reduced the list you're supplying to ndarray() to just 32 numbers, you'd still have a problem, because your 32 numbers are floating-point values, and ndarray() is expecting integral counts. (You'd get a TypeError.)
Along the approach you're trying to take – which isn't quite optimal as we'll get to below – you really want to create a single vector of 1000 floating-point dimensions. That is, 1000 cell-like values – not d[0] * d[1] * ... * d[999] separate cell-like values.
So a crude fix along the lines of your initial approach could be replacing your last line with either:
result = np.ndarray(len(d))
for i in range(len(d)):
result[i] = d[i]
But there are many ways to incrementally make this more efficient, compact, and idiomatic – a number of which I'll mention below, even though the best approach, at bottom, makes most of these interim steps unnecessary.
For one, instead of that assignment-loop in my code just above, you could use Python's bracket-indexing assignment option:
result = np.ndarray(len(d))
result[:] = d # same result as previous 3-lines w/ loop
But in fact, numpy's array() function can essentially create the necessary numpy-native ndarray from a given list, so instead of using ndarray() at all, you could just use array():
result = np.array(d) # same result as previous 2-lines
But further, numpy's many functions for natively working with arrays (and array-like lists) already include things to do averages-of-many-vectors in a single step (where even the looping is hidden inside very-efficient compiled code or CPU bulk-vector operations). For example, there's a mean() function that can average lists of numbers, or multi-dimensional arrays of numbers, or aligned sets of vectors, and so forth.
This allows faster, clearer, one-liner approaches that can replace your entire original code with something like:
# get a list of available word-vetors
doc = [model[word] for word in text if word in model.vocab]
# average all those vectors
out = np.mean(doc, axis=0)
(Without the axis argument, it'd average together all individual dimension-values , in all slots, into just one single final average number.)

Python matrix multiplication 3d Array

I tried to solve a PDE numerically and in the course of this I faced the problem of a triple-nested for loop resembling the 3 spatial dimension. This construct is nested in another time loop, so you can imagine that the computing takes forever for sufficient large node numbers. The code block looks like this
for jy in range(0,cy-1):
for jx in range(0,cx-1):
for jz in range(0,cz-1):
T[n+1,jx,jy,jz] = T[n,jx,jy,jz] + s*(T[n,jx-1,jy,jz] - 2*T[n,jx,jy,jz] + T[n,jx+1,jy,jz]) + s*(T[n,jx,jy-1,jz] - 2*T[n,jx,jy,jz] + T[n,jx,jy+1,jz]) + s*(T[n,jx,jy,jz-1] - 2*T[n,jx,jy,jz] + T[n,jx,jy,jz+1])
It might look intimidating at first, but is quite easy. I have a 3 dimensional matrix representing a solid bulk material, where each point represents the current temperature. The iteratively calculated next temperature at each point is calculated taking into account each point next to that point - so 6 in total. In the case of a 1-dimensional solid the solution is just a simple matrix multiplication. Is there any chance to represent the 3-loop-system above in a simple matrix solution like in the 1D case?
Best regards!
With numpy you can easily do these kinds of matrix operations,
e.g for a 3x3 matrix
import numpy as np
T = np.random.random((3,3,3))
T = T*T - 2*T ... etc.
First off, you need to be a bit more careful with your terminology. A "matrix" is a 2-Dimensional array of numbers. So you are really talking about an array. Numpy, or better yet Scipy, has an data type called an ndarray. You need to be very careful manipulating them, because although they are sometimes used to represent matrices, there are operations that can be performed on 2-D arrays that are not mathematically legal for matrices.
I strongly recommend you use # and not * to perform multiplication of 1- or 2-D matrices, and be sure to add code to check that the operations you are doing are legal mathematically. As a trivial example, Python lets you add a 1 x n or an n x 1 vector to an n x n matrix, even though that is not mathematically correct. The reason it allows it is, as intimated above, because there is no true matrix type in Python.
It very well may be that you can reformulate your problem to use a 3-D array, and by experimentation find the particular operation you are trying to perform. Just keep in mind that the rules of linear algebra are only casually applied in Python.

Scipy LinearOperator With Multiple Inputs

I need to invert a large, dense matrix which I hoped to use Scipy's gmres to do. Fortunately, the dense matrix A follows a pattern and I do not need to store the matrix in memory. The LinearOperator class allows us to construct an object which acts as the matrix for GMRES and can compute directly the matrix vector product A*v. That is, we write a function mv(v) which takes as input a vector v and returns mv(v) = A*v. Then, we can use the LinearOperator class to create A_LinOp = LinearOperator(shape = shape, matvec = mv). We can put the linear operator into the Scipy gmres command to evaluate the matrix vector products without ever having to fully load A into memory.
The documentation for the LinearOperator is found here: LinearOperator Documentation.
Here is my problem: to write the routine to compute the matrix vector product mv(v) = A*v, I need another input vector C. The entries in A are of the form A[i,j] = f(C[i] - C[j]). So, what I really want is for mv to be of two inputs, one fixed vector input C, and one variable input v which we want to compute A*v.
MATLAB has a similar setup, where would write x = gmres(#(v) mv(v,C),b) where b is the right hand side of the problem Ax = b, , and mv is the function that takes as variable input v which we want to compute A*v and C is the fixed, known vector which we need for the assembly of A.
My problem is that I can't figure out how to allow the LinearOperator class to accept two inputs, one variable and one "fixed" like I can in MATLAB.
Is there a way to do the analogous operation in SciPy? Alternatively, if anyone knows of a better way of inverting a large, dense matrix (50000, 50000) where the entries follow a pattern, I would greatly appreciate any suggestions.
Thanks!
EDIT: I should have stated this information actually. The matrix is actually (in block form) [A C; C^T 0], where A is N x N (N large) and C is N x 3, and the 0 is 3 x 3 and C^T is the transpose of C. This array C is the same array as the one mentioned above. The entries of A follow a pattern A[i,j] = f(C[i] - C[j]).
I wrote mv(v,C) to go row by row construct A*v[i] for i=0,N, by computing sum f(C[i]-C[j)*v[j] (actually, I do numpy.dot(FC,v) where FC[j] = f(C[i]-C[j]) which works well). Then, at the end doing the computations for the C^T rows. I was hoping to eventually replace the large for loop with a multiprocessing call to parallelize the for loop, but that's a future thing to consider. I will also look into using Cython to speed up the computations.
This is very late, but if you're still interested...
Your A matrix must be very low rank since it's a nonlinearly transformed version of a rank-2 matrix. Plus it's symmetric. That means it's trivial to inverse: get the truncated eigenvalue decompostion with, say, 5 eigenvalues: A = U*S*U', then invert that: A^-1 = U*S^-1*U'. S is diagonal so this is inexpensive. You can get the truncated eigenvalue decomposition with eigh.
That takes care of A. Then for the rest: use the block matrix inversion formula. Looks nasty, but I will bet you 100,000,000 prussian francs that it's 50x faster than the direct method you were using.
I faced the same situation (some years later than you) of trying to use more than one argument to LinearOperator, but for another problem. The solution I found was the use of global variables, to avoid passing the variables as arguments to the function.

Categories