Translating a Linear Regression from Matlab to Python

Translating a Linear Regression from Matlab to Python - python

I tried to translate a piece of code from Matlab to Python and I'm running into some errors:
Matlab:
function [beta] = linear_regression_train(traindata)
y = traindata(:,1); %output
ind2 = find(y == 2);
ind3 = find(y == 3);
y(ind2) = -1;
y(ind3) = 1;
X = traindata(:,2:257); %X matrix,with size of 1389x256
beta = inv(X'*X)*X'*y;
Python:
def linear_regression_train(traindata):
y = traindata[:,0] # This is the output
ind2 = (labels==2).nonzero()
ind3 = (labels==3).nonzero()
y[ind2] = -1
y[ind3] = 1
X = traindata[ : , 1:256]
X_T = numpy.transpose(X)
beta = inv(X_T*X)*X_T*y
return beta
I am receiving an error: operands could not be broadcast together with shapes (257,0,1389) (1389,0,257) on the line where beta is calculated.
Any help is appreciated!
Thanks!

The problem is that you are working with numpy arrays, not matrices as in MATLAB. Matrices, by default, do matrix mathematical operations. So X*Y does a matrix multiplication of X and Y. With arrays, however, the default is to use element-by-element operations. So X*Y multiplies each corresponding element of X and Y. This is the equivalent of MATLAB's .* operation.
But just like how MATLAB's matrices can do element-by-element operations, Numpy's arrays can do matrix multiplication. So what you need to do is use numpy's matrix multiplication instead of its element-by-element multiplication. For Python 3.5 or higher (which is the version you should be using for this sort of work), that is just the # operator. So your line becomes:
beta = inv(X_T # X) # X_T # y
Or, better yet, you can use the simpler .T transpose, which is the same as np.transpose but much more concise (you can get rid of the `np.transpose line entirely):
beta = inv(X.T # X) # X.T # y
For Python 3.4 or earlier, you will need to use np.dot since those versions of python don't have the # matrix multiplication operator:
beta = np.dot(np.dot(inv(np.dot(X.T, X)), X.T), y)
Numpy has a matrix object that uses matrix operations by default like the MATLAB matrix. Do not use it! It is slow, poorly-supported, and almost never what you really want. The Python community has standardized around arrays, so use those.
There may also be some issues with the dimensions of traindata. For this to work properly then traindata.ndim should be equal to 3. In order for y and X to be 2D, traindata should be 3D.
This could be an issue if traindata is 2D and you want y to be MATLAB-style "vector" (what MATLAB calls "vectors" aren't really vectors). In numpy, using a single index like traindata[:, 0] reduces the number of dimensions, while taking a slice like traindata[:, :1] doesn't. So to keep y 2D when traindata is 2D, just do a length-1 slice, traindata[:, :1]. This is exactly the same values, but this keeps the same number of dimensions as traindata.
Notes: Your code can be significantly simplified using logical indexing:
def linear_regression_train(traindata):
y = traindata[:, 0] # This is the output
y[labels == 2] = -1
y[labels == 3] = 1
X = traindata[:, 1:257]
return inv(X.T # X) # X.T # y
return beta
Also, your slice is wrong when defining X. Python slicing excludes the last value, so to get a 256 long slice you need to do 1:257, as I did above.
Finally, please keep in mind that modifications to arrays inside functions carry over outside the functions, and indexing does not make a copy. So your changes to y (setting some values to 1 and others to -1), will affect traindata outside of your function. If you want to avoid that, you need to make a copy before you make your changes:
y = traindata[:, 0].copy()

Related

apply function to unique values of NumPy array

I have a function f which I would like to apply to all elements of an arbitrarily-shaped and -ordered NumPy array x. Because the function evaluation is expensive and x may contain duplicate values, I first reduce x to unique values, a one-dimensional array xu.
xu, ind = np.unique(x, return_inverse=True)
I then create an array for the function values
yu = np.full(len(xu), np.nan)
and fill in this array by applying f elementwise.
I would now like to create an array y of the same shape as x, so that corresponding entries contain the result of the function. My attempt:
y = np.full(x.shape, np.nan)
y[ind] = yu
This fails if x isn't already one-dimensional. (You may guess that I'm used to Matlab, where linear indexing of a multidimensional array works.) What I would need for this is a one-dimensional view on y which I can apply [ind] = to, to assign to the correct elements.
Question 1: Is there such a one-dimensional view on a multidimensional array?
Alternatively, I could create y as one-dimensional, assign values, and then reshape.
y = np.full(x.size, np.nan)
y[ind] = yu
y = np.reshape(y, x.shape)
This seems to work, but I'm unsure whether I have to account for the storage order of x.
Question 2: Does ind returned by np.unique always follow 'C' order, which is default for np.reshape, or does it depend on the internal structure of x?

The indices for np.unique operates on a raveled array. This is documented under the first parameter:
Unless axis is specified, this will be flattened if it is not already 1-D.
Ravelling/flattening always happens in C order, regardless of the memory layout. Flattening is just raveling that guarantees a copy. That means that it creates a copy when your array is not in C order:
>>> x = np.zeros((3, 3), order='F')
>>> x.ravel().base is x
False
>>> y = np.zeros((3, 3))
>>> y.ravel().base is y
True
x.ravel() is equivalent to x.reshape(-1). That means that if you can unravel the result with something like flat_y.reshape(original_x_shape):
xu, ind = np.unique(x, return_inverse=True)
yu = np.zeros_like(xu)
for i in range(len(xu)):
yu[i] = fn(xu[i])
y_flat = yu[ind]
y = y_flat.reshape(x.shape)
Since you are reshaping a contiguous buffer, y and y_flat share the same memory:
>>> y.base is y_flat
True
Fancy indexing, as in the expression y_flat = yu[ind] will always make a copy, since you can't tell if the data is contiguous or not in the general case.
Part of the reason that linear indexing always works in MATLAB is that it guarantees contiguous arrays, always stored in column-major order. Numpy maintains a length in strides in each dimension, so it supports non-contiguous arrays. That allows numpy to do things like transpose an array, or get simple slices from it, without making a copy of the underlying data.
On a side note, if you want to avoid explicitly calling reshape on y, can call it on ind instead:
xu, ind = np.unique(x, return_inverse=True)
yu = np.zeros_like(xu)
for i in range(len(xu)):
yu[i] = fn(xu[i])
y = yu[ind.reshape(x.shape)]

Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

I'm trying to implement the n-mode tensor-matrix product (as defined by Kolda and Bader: https://www.sandia.gov/~tgkolda/pubs/pubfiles/SAND2007-6702.pdf) efficiently in Python using Numpy. The operation effectively gets down to (for matrix U, tensor X and axis/mode k):
Extract all vectors along axis k from X by collapsing all other axes.
Multiply these vectors on the left by U using standard matrix multiplication.
Insert the vectors again into the output tensor using the same shape, apart from X.shape[k], which is now equal to U.shape[0] (initially, X.shape[k] must be equal to U.shape[1], as a result of the matrix multiplication).
I've been using an explicit implementation for a while which performs all these steps separately:
Transpose the tensor to bring axis k to the front (in my full code I added an exception in case k == X.ndim - 1, in which case it's faster to leave it there and transpose all future operations, or at least in my application, but that's not relevant here).
Reshape the tensor to collapse all other axes.
Calculate the matrix multiplication.
Reshape the tensor to reconstruct all other axes.
Transpose the tensor back into the original order.
I would think this implementation creates a lot of unnecessary (big) arrays, so once I discovered np.einsum I thought this would speed things up considerably. However using the code below I got worse results:
import numpy as np
from time import time
def mode_k_product(U, X, mode):
transposition_order = list(range(X.ndim))
transposition_order[mode] = 0
transposition_order[0] = mode
Y = np.transpose(X, transposition_order)
transposed_ranks = list(Y.shape)
Y = np.reshape(Y, (Y.shape[0], -1))
Y = U # Y
transposed_ranks[0] = Y.shape[0]
Y = np.reshape(Y, transposed_ranks)
Y = np.transpose(Y, transposition_order)
return Y
def einsum_product(U, X, mode):
axes1 = list(range(X.ndim))
axes1[mode] = X.ndim + 1
axes2 = list(range(X.ndim))
axes2[mode] = X.ndim
return np.einsum(U, [X.ndim, X.ndim + 1], X, axes1, axes2, optimize=True)
def test_correctness():
A = np.random.rand(3, 4, 5)
for i in range(3):
B = np.random.rand(6, A.shape[i])
X = mode_k_product(B, A, i)
Y = einsum_product(B, A, i)
print(np.allclose(X, Y))
def test_time(method, amount):
U = np.random.rand(256, 512)
X = np.random.rand(512, 512, 256)
start = time()
for i in range(amount):
method(U, X, 1)
return (time() - start)/amount
def test_times():
print("Explicit:", test_time(mode_k_product, 10))
print("Einsum:", test_time(einsum_product, 10))
test_correctness()
test_times()
Timings for me:
Explicit: 3.9450525522232054
Einsum: 15.873924326896667
Is this normal or am I doing something wrong? I know there are circumstances where storing intermediate results can decrease complexity (e.g. chained matrix multiplication), however in this case I can't think of any calculations that are being repeated. Is matrix multiplication so optimized that it removes the benefits of not transposing (which technically has a lower complexity)?

I'm more familiar with the subscripts style of using einsum, so worked out these equivalences:
In [194]: np.allclose(np.einsum('ij,jkl->ikl',B0,A), einsum_product(B0,A,0))
Out[194]: True
In [195]: np.allclose(np.einsum('ij,kjl->kil',B1,A), einsum_product(B1,A,1))
Out[195]: True
In [196]: np.allclose(np.einsum('ij,klj->kli',B2,A), einsum_product(B2,A,2))
Out[196]: True
With a mode parameter, your approach in einsum_product may be best. But the equivalences help me visualize the calculation better, and may help others.
Timings should basically be the same. There's an extra setup time in einsum_product that should disappear in larger dimensions.

After updating Numpy, Einsum is only slightly slower than the explicit method, with or without multi-threading (see comments to my question).

Does np.dot automatically transpose vectors?

I am trying to calculate the first and second order moments for a portfolio of stocks (i.e. expected return and standard deviation).
expected_returns_annual
Out[54]:
ticker
adj_close CNP 0.091859
F -0.007358
GE 0.095399
TSLA 0.204873
WMT -0.000943
dtype: float64
type(expected_returns_annual)
Out[55]: pandas.core.series.Series
weights = np.random.random(num_assets)
weights /= np.sum(weights)
returns = np.dot(expected_returns_annual, weights)
So normally the expected return is calculated by
(x1,...,xn' * (R1,...,Rn)
with x1,...,xn are weights with a constraint that all the weights have to sum up to 1 and ' means that the vector is transposed.
Now I am wondering a bit about the numpy dot function, because
returns = np.dot(expected_returns_annual, weights)
and
returns = np.dot(expected_returns_annual, weights.T)
give the same results.
I tested also the shape of weights.T and weights.
weights.shape
Out[58]: (5,)
weights.T.shape
Out[59]: (5,)
The shape of weights.T should be (,5) and not (5,), but numpy displays them as equal (I also tried np.transpose, but there is the same result)
Does anybody know why numpy behave this way? In my opinion the np.dot product automatically shape the vector the right why so that the vector product work well. Is that correct?
Best regards
Tom

The semantics of np.dot are not great
As Dominique Paul points out, np.dot has very heterogenous behavior depending on the shapes of the inputs. Adding to the confusion, as the OP points out in his question, given that weights is a 1D array, np.array_equal(weights, weights.T) is True (array_equal tests for equality of both value and shape).
Recommendation: use np.matmul or the equivalent # instead
If you are someone just starting out with Numpy, my advice to you would be to ditch np.dot completely. Don't use it in your code at all. Instead, use np.matmul, or the equivalent operator #. The behavior of # is more predictable than that of np.dot, while still being convenient to use. For example, you would get the same dot product for the two 1D arrays you have in your code like so:
returns = expected_returns_annual # weights
You can prove to yourself that this gives the same answer as np.dot with this assert:
assert expected_returns_annual # weights == expected_returns_annual.dot(weights)
Conceptually, # handles this case by promoting the two 1D arrays to appropriate 2D arrays (though the implementation doesn't necessarily do this). For example, if you have x with shape (N,) and y with shape (M,), if you do x # y the shapes will be promoted such that:
x.shape == (1, N)
y.shape == (M, 1)
Complete behavior of matmul/#
Here's what the docs have to say about matmul/# and the shapes of inputs/outputs:
If both arguments are 2-D they are multiplied like conventional matrices.
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
If the first argument is 1-D, it is promoted to a matrix by prepending a 1 to its dimensions. After matrix multiplication the prepended 1 is removed.
If the second argument is 1-D, it is promoted to a matrix by appending a 1 to its dimensions. After matrix multiplication the appended 1 is removed.
Notes: the arguments for using # over dot
As hpaulj points out in the comments, np.array_equal(x.dot(y), x # y) for all x and y that are 1D or 2D arrays. So why do I (and why should you) prefer #? I think the best argument for using # is that it helps to improve your code in small but significant ways:
# is explicitly a matrix multiplication operator. x # y will raise an error if y is a scalar, whereas dot will make the assumption that you actually just wanted elementwise multiplication. This can potentially result in a hard-to-localize bug in which dot silently returns a garbage result (I've personally run into that one). Thus, # allows you to be explicit about your own intent for the behavior of a line of code.
Because # is an operator, it has some nice short syntax for coercing various sequence types into arrays, without having to explicitly cast them. For example, [0,1,2] # np.arange(3) is valid syntax.
To be fair, while [0,1,2].dot(arr) is obviously not valid, np.dot([0,1,2], arr) is valid (though more verbose than using #).
When you do need to extend your code to deal with many matrix multiplications instead of just one, the ND cases for # are a conceptually straightforward generalization/vectorization of the lower-D cases.

I had the same question some time ago. It seems that when one of your matrices is one dimensional, then numpy will figure out automatically what you are trying to do.
The documentation for the dot function has a more specific explanation of the logic applied:
If both a and b are 1-D arrays, it is inner product of vectors
(without complex conjugation).
If both a and b are 2-D arrays, it is matrix multiplication, but using
matmul or a # b is preferred.
If either a or b is 0-D (scalar), it is equivalent to multiply and
using numpy.multiply(a, b) or a * b is preferred.
If a is an N-D array and b is a 1-D array, it is a sum product over
the last axis of a and b.
If a is an N-D array and b is an M-D array (where M>=2), it is a sum
product over the last axis of a and the second-to-last axis of b:

In NumPy, a transpose .T reverses the order of dimensions, which means that it doesn't do anything to your one-dimensional array weights.
This is a common source of confusion for people coming from Matlab, in which one-dimensional arrays do not exist. See Transposing a NumPy Array for some earlier discussion of this.
np.dot(x,y) has complicated behavior on higher-dimensional arrays, but its behavior when it's fed two one-dimensional arrays is very simple: it takes the inner product. If we wanted to get the equivalent result as a matrix product of a row and column instead, we'd have to write something like
np.asscalar(x # y[:, np.newaxis])
adding a trailing dimension to y to turn it into a "column", multiplying, and then converting our one-element array back into a scalar. But np.dot(x,y) is much faster and more efficient, so we just use that.
Edit: actually, this was dumb on my part. You can, of course, just write matrix multiplication x # y to get equivalent behavior to np.dot for one-dimensional arrays, as tel's excellent answer points out.

The shape of weights.T should be (,5) and not (5,),
suggests some confusion over the shape attribute. shape is an ordinary Python tuple, i.e. just a set of numbers, one for each dimension of the array. That's analogous to the size of a MATLAB matrix.
(5,) is just the way of displaying a 1 element tuple. The , is required because of older Python history of using () as a simple grouping.
In [22]: tuple([5])
Out[22]: (5,)
Thus the , in (5,) does not have a special numpy meaning, and
In [23]: (,5)
File "<ipython-input-23-08574acbf5a7>", line 1
(,5)
^
SyntaxError: invalid syntax
A key difference between numpy and MATLAB is that arrays can have any number of dimensions (upto 32). MATLAB has a lower boundary of 2.
The result is that a 5 element numpy array can have shapes (5,), (1,5), (5,1), (1,5,1)`, etc.
The handling of a 1d weight array in your example is best explained the np.dot documentation. Describing it as inner product seems clear enough to me. But I'm also happy with the
sum product over the last axis of a and the second-to-last axis of b
description, adjusted for the case where b has only one axis.
(5,) with (5,n) => (n,) # 5 is the common dimension
(n,5) with (5,) => (n,)
(n,5) with (5,1) => (n,1)
In:
(x1,...,xn' * (R1,...,Rn)
are you missing a )?
(x1,...,xn)' * (R1,...,Rn)
And the * means matrix product? Not elementwise product (.* in MATLAB)? (R1,...,Rn) would have size (n,1). (x1,...,xn)' size (1,n). The product (1,1).
By the way, that raises another difference. MATLAB expands dimensions to the right (n,1,1...). numpy expands them to the left (1,1,n) (if needed by broadcasting). The initial dimensions are the outermost ones. That's not as critical a difference as the lower size 2 boundary, but shouldn't be ignored.

Preserving dimensions when slicing symbolic block matrices in sympy

I am using sympy (python 3.6, sympy 1.0) to facilitate the calculation of matrix-transformations in mathematical proofs.
To calculate the Schur complements it is necessary to slice a block-matrix consisting of symbolic matrices.
As directly addressing the matrix with:
M[0:1,1]
is not working I tried sympy.matrices.expressions.blockmatrix.blocks Unfortunately blocks is messing up the dimensions of the matrices when addressing a range of blocks:
from sympy import *
n = Symbol('n')
Aj = MatrixSymbol('Aj', n,n)
M = BlockMatrix([[Aj, Aj],[Aj, Aj]])
M1 = M.blocks[0:1,0:1]
M2 = M.blocks[0,0]
print(M1.shape)
print(M2.shape)
M.blocks returns a matrix with the dimension 1,1 for the matrix M1 while the matrix M2 has the right dimension n,n.
Any suggestion how to get the right dimensions when using an interval ?

The method blocks returns an ImmutableMatrix object, not a BlockMatrix object. Here it is for reference:
def blocks(self):
from sympy.matrices.immutable import ImmutableMatrix
mats = self.args
data = [[mats[i] if i == j else ZeroMatrix(mats[i].rows, mats[j].cols)
for j in range(len(mats))]
for i in range(len(mats))]
return ImmutableMatrix(data)
The shape of an ImmutableMatrix object is determined by the number of symbols it contains; the structure of symbols is not taken into account. Hence, you get (1,1).
When using M.blocks[0,0] you access an element of the matrix, which is Aj. This is known as a MatrixSymbol, so the shape works as expected.
When using M.blocks[0:1, 0:1] you slice a SymPy matrix. Slicing always returns a submatrix, even if the size of the slice is 1 by 1. So you get an ImmutableMatrix with one entry, Matrix([[Aj]]). As said above, the shape of this thing is (1,1) since there is no recognition of the block structure.
As user2357112 suggested, converting the sliced output of blocks into a BlockMatrix causes the shape to be determined on the basis of the shape of Aj:
>>> M3 = BlockMatrix(M.blocks[0:, 0:1])
>>> M3.shape
(2*n, n)
It's often useful to check the type of objects that behave in unexpected way: e.g., type(M1) and type(M2).

Matlab to Python numpy indexing and multiplication issue

I have the following line of code in MATLAB which I am trying to convert to Python numpy:
pred = traindata(:,2:257)*beta;
In Python, I have:
pred = traindata[ : , 1:257]*beta
beta is a 256 x 1 array.
In MATLAB,
size(pred) = 1389 x 1
But in Python,
pred.shape = (1389L, 256L)
So, I found out that multiplying by the beta array is producing the difference between the two arrays.
How do I write the original Python line, so that the size of pred is 1389 x 1, like it is in MATLAB when I multiply by my beta array?

I suspect that beta is in fact a 1D numpy array. In numpy, 1D arrays are not row or column vectors where MATLAB clearly makes this distinction. These are simply 1D arrays agnostic of any shape. If you must, you need to manually introduce a new singleton dimension to the beta vector to facilitate the multiplication. On top of this, the * operator actually performs element-wise multiplication. To perform matrix-vector or matrix-matrix multiplication, you must use numpy's dot function to do so.
Therefore, you must do something like this:
import numpy as np # Just in case
pred = np.dot(traindata[:, 1:257], beta[:,None])
beta[:,None] will create a 2D numpy array where the elements from the 1D array are populated along the rows, effectively making a column vector (i.e. 256 x 1). However, if you have already done this on beta, then you don't need to introduce the new singleton dimension. Just use dot normally:
pred = np.dot(traindata[:, 1:257], beta)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.