Scipy fitting polynomial model to some data - python

I do try to find an appropriate function for the permeability of cells under varying conditions. If I assume constant permeability, I can fit it to the experimental data and use Sklearns PolynomialFeatures together with a LinearModel (As explained in this post) in order to determine a correlation between the conditions and the permeability. However, the permeability is not constant and now I try to fit my model with the permeability as a function of the process conditions. The PolynomialFeature module of sklearn is quite nice to use.
Is there an equivalent function within scipy or numpy which allows me to create a polynomial model (including interaction terms e.g. a*x[0]*x[1] etc.) of varying order without writing the whole function by hand ?
The standard polynomial class in numpy seems not to support interaction terms.

I'm not aware of such a function that does exactly what you need, but you can achieve it using a combination of itertools and numpy.
If you have n_features predictor variables, you essentially must generate all vectors of length n_features whose entries are non-negative integers and sum to the specified order. Each new feature column is the component-wise power using these vectors who sum to a given order.
For example, if order = 3 and n_features = 2, one of the new features will be the old features raise to the respective powers, [2,1]. I've written some code below for arbitrary order and number of features. I've modified the generation of vectors who sum to order from this post.
import itertools
import numpy as np
from scipy.special import binom
def polynomial_features_with_cross_terms(X, order):
"""
X: numpy ndarray
Matrix of shape, `(n_samples, n_features)`, to be transformed.
order: integer, default 2
Order of polynomial features to be computed.
returns: T, powers.
`T` is a matrix of shape, `(n_samples, n_poly_features)`.
Note that `n_poly_features` is equal to:
`n_features+order-1` Choose `n_features-1`
See: https://en.wikipedia.org\
/wiki/Stars_and_bars_%28combinatorics%29#Theorem_two
`powers` is a matrix of shape, `(n_features, n_poly_features)`.
Each column specifies the power by row of the respective feature,
in the respective column of `T`.
"""
n_samples, n_features = X.shape
n_poly_features = int(binom(n_features+order-1, n_features-1))
powers = np.zeros((n_features, n_poly_features))
T = np.zeros((n_samples, n_poly_features), dtype=X.dtype)
combos = itertools.combinations(range(n_features+order-1), n_features-1)
for i,c in enumerate(combos):
powers[:,i] = np.array([
b-a-1 for a,b in zip((-1,)+c, c+(n_features+order-1,))
])
T[:,i] = np.prod(np.power(X, powers[:,i]), axis=1)
return T, powers
Here's some example usage:
>>> X = np.arange(-5,5).reshape(5,2)
>>> T,p = polynomial_features_with_cross_terms(X, order=3)
>>> print X
[[-5 -4]
[-3 -2]
[-1 0]
[ 1 2]
[ 3 4]]
>>> print p
[[ 0. 1. 2. 3.]
[ 3. 2. 1. 0.]]
>>> print T
[[ -64 -80 -100 -125]
[ -8 -12 -18 -27]
[ 0 0 0 -1]
[ 8 4 2 1]
[ 64 48 36 27]]
Finally, I should mention that the SVM polynomial kernel achieves exactly this effect without explicitly computing the polynomial map. There are of course pro's and con's to this, but I figured I should mentioned it for you to consider if you have not, yet.

Related

Random matrix with sum of values by column = 1 in python

Create a matrix like a transtision matrix
How i can create random matrix with sum of values by column = 1 in python ?
(EDIT: added output)
I suggest completing this in two steps:
Create a random matrix
Normalize each column
1. Create random matrix
Let's say you want a 3 by 3 random transition matrix:
M = np.random.rand(3, 3)
Each of M's entries will have a random value between 0 and 1.
Normalize M's columns
By dividing each column by the column sum will achieve what you want. This can be done in several ways, but I prefer to create an array r whose elements is the column sum of M:
r = M.sum(axis=0)
Then, divide M by r:
transition_matrix = M / r
Example output
>>> import numpy as np
>>> M = np.random.rand(3,3 )
>>> r = M.sum(axis=0)
>>> transition_matrix = M / r
>>> M
array([[0.74145687, 0.68389986, 0.37008102],
[0.81869654, 0.0394523 , 0.94880781],
[0.93057194, 0.48279246, 0.15581823]])
>>> r
array([2.49072535, 1.20614462, 1.47470706])
>>> transition_matrix
array([[0.29768713, 0.56701315, 0.25095223],
[0.32869804, 0.03270943, 0.64338731],
[0.37361483, 0.40027743, 0.10566046]])
>>> transition_matrix.sum(axis=0)
array([1., 1., 1.])
You could use KNOWN distribution where each sample would have (by default) summed to one, e.g. Dirichlet distribution.
After that code is basically one liner, Python 3.8, Windows 10 x64
import numpy as np
N = 3
# set alphas array, 1s by default
a = np.empty(N)
a.fill(1.0)
mtx = np.random.dirichlet(a, N).transpose()
print(mtx)
and it will print something like
[[0.56634637 0.04568052 0.79105779]
[0.42542107 0.81892862 0.02465906]
[0.00823256 0.13539087 0.18428315]]
UPDATE
For the case of "sample something and normalize", problem is one would get value from unknown distribution. For Dirichlet there are expressions for mean, std.dev, PDF, CDF, you name it.
Even for the case with Xi sampled from U(0,1) what would be distribution of values for Xi/Sum(i, Xi).
Anything to say about mean? std.dev? PDF? Other stat properties?
You could sample from exponential and get sum normalized to 1, but
question would be even more acute - if Xi is Exp(1), what is the distribution for Xi/Sum(i, Xi) ? PDF? Mean? Std.dev?

How to apply cramer V on 2x2 matrix

I want to find the association between variables and cramer V works like a treat for matrices of sizes greater than 2X2. However, for matrices with low frequencies, it does not work well. For the following contingency matrix, i get the result as 0.5. How can I account for the same?
1 2
a 2 0
b 0 2
Here is my code:
def cramers_stat(confusion_matrix):
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))
result=cramers_stat(confusion_matrix)
print(result)
confusion_matrix is my input, in this case the matrix i mentioned above. I understand for good results, i need a matrix frequency above 5, but for perfect association as the case above I expected the result to be 1.
When you compute the Cramér coefficient, you must compute chi2 without continuity correction. For a 2x2 matrix, chi2_contingency uses continuity correction by default. So you must tell chi2_contingency to not use continuity correction by giving the argument correction=False:
chi2 = ss.chi2_contingency(confusion_matrix, correction=False)[0]

How can I create an n-dimensional grid in numpy to evaluate a function for arbitrary n?

I'm trying to create a naive numerical integration function to illustrate the benefits of Monte Carlo integration in high dimensions. I want something like this:
def quad_int(f, mins, maxs, numPoints=100):
'''
Use the naive (Riemann sum) method to numerically integrate f on a box
defined by the mins and maxs.
INPUTS:
f - A function handle. Should accept a 1-D NumPy array
as input.
mins - A 1-D NumPy array of the minimum bounds on integration.
maxs - A 1-D NumPy array of the maximum bounds on integration.
numPoints - An integer specifying the number of points to sample in
the Riemann-sum method. Defaults to 100.
'''
n = len(mins)
# Create a grid of evenly spaced points to evaluate f on
# Evaluate f at each point in the grid; sum all these values up
dV = np.prod((maxs-mins/numPoints))
# Multiply the sum by dV to get the approximate integral
I know my dV is going to cause problems with numerical stability, but right now what I'm having trouble with is creating the domain. If the number of dimensions was fixed, it would be easy enough to just use np.meshgrid like this:
# We don't want the last value since we are using left-hand Riemann sums
x = np.linspace(mins[0],maxs[0],numPoints)[:-1]
y = np.linspace(mins[1],maxs[1],numPoints)[:-1]
z = np.linspace(mins[2],maxs[2],numPoints)[:-1]
X, Y, Z = np.meshgrid(x,y,z)
tot = 0
for index, x in np.ndenumerate(X):
tot += f(x, Y[index], Z[index])
Is there an analogue to np.meshgrid that can do this for arbitrary dimensions, maybe accept a tuple of arrays? Or is there some other way to do Riemann sums in higher dimensions? I've thought about doing it recursively but can't figure out how that would work.
You could use a list comprehension to generate all of the linspaces, and then pass that list to meshgrid with a * (to convert the list to a tuple of arguments).
XXX = np.meshgrid(*[np.linspace(i,j,numPoints)[:-1] for i,j in zip(mins,maxs)])
XXX is now a list of n arrays (each n dimensional).
I'm using straight forward Python list and argument operations.
np.lib.index_tricks has other index and grid generation functions and classes that might be of use. It's worth reading just to see how things can be done.
A neat trick used in various numpy functions when indexing arrays of unknown dimension, is to construct as list of the desired indices. It can include slice(None) where you'd normally see :. Then convert it to a tuple and use it.
In [606]: index=[2,3]
In [607]: [slice(None)]+index
Out[607]: [slice(None, None, None), 2, 3]
In [609]: Y[tuple([slice(None)]+index)]
Out[609]: array([ 0. , 0.5, 1. , 1.5])
In [611]: Y[:,2,3]
Out[611]: array([ 0. , 0.5, 1. , 1.5])
They use a list where they need to change elements. Converting to a tuple isn't always needed
index=[slice(None)]*3
index[1]=0
Y[index] # same as Y[:,0,:]

numpy covariance matrix

Suppose I have two vectors of length 25, and I want to compute their covariance matrix. I try doing this with numpy.cov, but always end up with a 2x2 matrix.
>>> import numpy as np
>>> x=np.random.normal(size=25)
>>> y=np.random.normal(size=25)
>>> np.cov(x,y)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
Using the rowvar flag doesn't help either - I get exactly the same result.
>>> np.cov(x,y,rowvar=0)
array([[ 0.77568388, 0.15568432],
[ 0.15568432, 0.73839014]])
How can I get the 25x25 covariance matrix?
You have two vectors, not 25. The computer I'm on doesn't have python so I can't test this, but try:
z = zip(x,y)
np.cov(z)
Of course.... really what you want is probably more like:
n=100 # number of points in each vector
num_vects=25
vals=[]
for _ in range(num_vects):
vals.append(np.random.normal(size=n))
np.cov(vals)
This takes the covariance (I think/hope) of num_vects 1xn vectors
Try this:
import numpy as np
x=np.random.normal(size=25)
y=np.random.normal(size=25)
z = np.vstack((x, y))
c = np.cov(z.T)
 Covariance matrix from samples vectors
To clarify the small confusion regarding what is a covariance matrix defined using two N-dimensional vectors, there are two possibilities.
The question you have to ask yourself is whether you consider:
each vector as N realizations/samples of one single variable (for example two 3-dimensional vectors [X1,X2,X3] and [Y1,Y2,Y3], where you have 3 realizations for the variables X and Y respectively)
each vector as 1 realization for N variables (for example two 3-dimensional vectors [X1,Y1,Z1] and [X2,Y2,Z2], where you have 1 realization for the variables X,Y and Z per vector)
Since a covariance matrix is intuitively defined as a variance based on two different variables:
in the first case, you have 2 variables, N example values for each, so you end up with a 2x2 matrix where the covariances are computed thanks to N samples per variable
in the second case, you have N variables, 2 samples for each, so you end up with a NxN matrix
About the actual question, using numpy
if you consider that you have 25 variables per vector (took 3 instead of 25 to simplify example code), so one realization for several variables in one vector, use rowvar=0
# [X1,Y1,Z1]
X_realization1 = [1,2,3]
# [X2,Y2,Z2]
X_realization2 = [2,1,8]
numpy.cov([X,Y],rowvar=0) # rowvar false, each column is a variable
Code returns, considering 3 variables:
array([[ 0.5, -0.5, 2.5],
[-0.5, 0.5, -2.5],
[ 2.5, -2.5, 12.5]])
otherwise, if you consider that one vector is 25 samples for one variable, use rowvar=1 (numpy's default parameter)
# [X1,X2,X3]
X = [1,2,3]
# [Y1,Y2,Y3]
Y = [2,1,8]
numpy.cov([X,Y],rowvar=1) # rowvar true (default), each row is a variable
Code returns, considering 2 variables:
array([[ 1. , 3. ],
[ 3. , 14.33333333]])
Reading the documentation as,
>> np.cov.__doc__
or looking at Numpy Covariance, Numpy treats each row of array as a separate variable, so you have two variables and hence you get a 2 x 2 covariance matrix.
I think the previous post has right solution. I have the explanation :-)
I suppose what youre looking for is actually a covariance function which is a timelag function. I'm doing autocovariance like that:
def autocovariance(Xi, N, k):
Xs=np.average(Xi)
aCov = 0.0
for i in np.arange(0, N-k):
aCov = (Xi[(i+k)]-Xs)*(Xi[i]-Xs)+aCov
return (1./(N))*aCov
autocov[i]=(autocovariance(My_wector, N, h))
You should change
np.cov(x,y, rowvar=0)
onto
np.cov((x,y), rowvar=0)
What you got (2 by 2) is more useful than 25*25. Covariance of X and Y is an off-diagonal entry in the symmetric cov_matrix.
If you insist on (25 by 25) which I think useless, then why don't you write out the definition?
x=np.random.normal(size=25).reshape(25,1) # to make it 2d array.
y=np.random.normal(size=25).reshape(25,1)
cov = np.matmul(x-np.mean(x), (y-np.mean(y)).T) / len(x)
As pointed out above, you only have two vectors so you'll only get a 2x2 cov matrix.
IIRC the 2 main diagonal terms will be sum( (x-mean(x))**2) / (n-1) and similarly for y.
The 2 off-diagonal terms will be sum( (x-mean(x))(y-mean(y)) ) / (n-1). n=25 in this case.
according the document, you should expect variable vector in column:
If we examine N-dimensional samples, X = [x1, x2, ..., xn]^T
though later it says each row is a variable
Each row of m represents a variable.
so you need input your matrix as transpose
x=np.random.normal(size=25)
y=np.random.normal(size=25)
X = np.array([x,y])
np.cov(X.T)
and according to wikipedia: https://en.wikipedia.org/wiki/Covariance_matrix
X is column vector variable
X = [X1,X2, ..., Xn]^T
COV = E[X * X^T] - μx * μx^T // μx = E[X]
you can implement it yourself:
# X each row is variable
X = X - X.mean(axis=0)
h,w = X.shape
COV = X.T # X / (h-1)
i don't think you understand the definition of covariance matrix.
If you need 25 x 25 covariance matrix, you need 25 vectors each with n data points.

Numpy- weight and sum rows of a matrix

Using Python & Numpy, I would like to:
Consider each row of an (n columns x
m rows) matrix as a vector
Weight each row (scalar
multiplication on each component of
the vector)
Add each row to create a final vector
(vector addition).
The weights are given in a regular numpy array, n x 1, so that each vector m in the matrix should be multiplied by weight n.
Here's what I've got (with test data; the actual matrix is huge), which is perhaps very un-Numpy and un-Pythonic. Can anyone do better? Thanks!
import numpy
# test data
mvec1 = numpy.array([1,2,3])
mvec2 = numpy.array([4,5,6])
start_matrix = numpy.matrix([mvec1,mvec2])
weights = numpy.array([0.5,-1])
#computation
wmatrix = [ weights[n]*start_matrix[n] for n in range(len(weights)) ]
vector_answer = [0,0,0]
for x in wmatrix: vector_answer+=x
Even a 'technically' correct answer has been all ready given, I'll give my straightforward answer:
from numpy import array, dot
dot(array([0.5, -1]), array([[1, 2, 3], [4, 5, 6]]))
# array([-3.5 -4. -4.5])
This one is much more on with the spirit of linear algebra (and as well those three dotted requirements on top of the question).
Update:
And this solution is really fast, not marginally, but easily some (10- 15)x faster than all ready proposed one!
It will be more convenient to use a two-dimensional numpy.array than a numpy.matrix in this case.
start_matrix = numpy.array([[1,2,3],[4,5,6]])
weights = numpy.array([0.5,-1])
final_vector = (start_matrix.T * weights).sum(axis=1)
# array([-3.5, -4. , -4.5])
The multiplication operator * does the right thing here due to NumPy's broadcasting rules.

Categories