I have velocity data (mf) for a fluid at 5 axial locations (x) for 14 different combinations of two parameters of the fluid (Re, k). The velocity data is dependent on Re, k and x.
I would like to use sklearn to do polynomial regression of my data as in this post, but I am facing some problems:
How should I build the X matrix (the matrix of the independent variables)? It seems to me that there are 3 independent variables here (Re, k and x) but I have 14 values of Re, 14 values of k and only 5 values for x.
Would it be possible to regress with degree=1 w.r.t. Re and k and degree=3 w.r.t. x?
Any help is appreciated. Thanks!
If you have three 2d array-like objects Re, k, and x, you can create polynomial features of degree=3 on just x, by applying the PolynomialFeatures transformer to just x before stacking the features into a single matrix.
poly_x = PolynomialFeatures(3)
X = np.hstack([Re, k, poly_x.fit_transform(x)])
Related
I need help performing polynomial features on 3 dimensional data and performing linear regression to create a line of best fit on the 3 dimensional polynomial.
I have a random dataframe with x, y, and z as the columns that forms a polynomial scatterplot.
X and Y are similar values while z is vastly different.
Example:
X=(-3,9,-20,-8,-14)
Y=(-2,8,-19,-8,-13)
Z=(-960,110,4867,-149,1493)
I have done this for 2 dimensional data but not 3d.
poly=PolynomialFeatures(degree=2,include_bias=False)
X_poly=poly.fit_transform(X.reshape(-1,1))
X_poly[0]
However, how do I handle the data when I have x, y, and z? Do I need to perform poly.fit_transform x and y?
Next I did linear regression
from sklearn.linear_model import LinearRegression
LinReg = LinearRegression()
LinReg.fit(X_poly,z)
Then when I create test data for x and y and perform the predict method on z, the resulting line is linear instead of a polynomial.
Any help would be much appreciated.
I finally figured it out. I needed to pass a DataFrame containing only x and y through the polynomial features and then use the XY_poly and z in the linreg.fit(). This trains the model for my next steps to create the line of best fit for the polynomial.
When PolynomialFeatures says that fit_transform() method requires x and y, here x is an n-dimensional array of features and y are the target values which are optional. In your case I would do the following:
X=(-3,9,-20,-8,-14)
Y=(-2,8,-19,-8,-13)
Z=(-960,110,4867,-149,1493)
foo = np.array([X, Y, Z])
foo = foo.transpose() # This transposes the array to bring it to shape (n, 3)
poly = PolynomialFeatures(3)
poly.fit_transform(foo)
One this is done you can use fit_transform(foo).
I want to draw categorical vectors where its prior is a product of Dirichlet distributions. The categories are fixed and each element in the categorical vector corresponds to a different Dirichlet prior. Here is a categorical vector of length 33 with 4 categories, setup with prior with a Dirichlet.
import pymc3 as pm
with pm.Model() as model3:
theta = pm.Dirichlet(name='theta',a=np.ones((33,4)), shape=(33,4))
seq = [pm.Categorical(name='seq_{}'.format(str(i)), p=theta[i,:], shape=(1,)) for i in range(33)]
step1 = pm.Metropolis(vars=[theta])
step2 = [pm.CategoricalGibbsMetropolis(vars=[i]) for i in seq]
trace = pm.sample(50, step=[step1] + [i for i in step2])
However this approach is cumbersome as I have to do some array indexing to get the categorical vectors out. Are there better ways of doing this?
You don't need to specify the shape. Note that the way you've set it up there are 33 different categorical variables; I'm assuming that's what you've intended. Here's the easier way to do that:
with pm.Model() as model:
theta = pm.Dirichlet(name='theta',a=np.ones(4))
children = [pm.Categorical(f"seq_{i}", p=theta) for i in range(33)]
I know it is possible to fit several sequences into hmmlearn but it seems to me that these sequences need to be drawn from the same distributions.
Is it possible to fit a GMHMM with several observations sequences drawn from different distributions in hmmlearn?
My use case :
I would like to fit a GMHMM with K financial time series from different stocks and predict the market regime that generated the K stock prices at a specified time.
So the matrix input has dimension N (number of dates) × K (number of stocks).
If hmmlearn can't do that, please tell me if it is possible with another package in python or R?
Thanks for you help!
My approach to your problem will be to use a multi-variate Gaussian for emission probabilities.
For example: let's assume that K is 2, i.e., the number of locations is 2.
In hmmlearn, the K will be encoded in the dimensions of the mean matrix.
See, this example Sampling from HMM has a 2-dimensional output. In other words the X.shape = (N, K) where N is the length of the sample 500 in this case, and K is the dimension of the output which is 2.
Notice that the authors plotted each dimension on an axis, i.e., x-axis plots the first dimension X[:, 0], and the y-axis the second dimension X[:, 1].
To train your model, make sure that X1 and X2 are of the same shape as the sampled X in the example, and form the training dataset as described here.
In summary, adapt the example to your case by adjusting the K instead of K=2 and convert it to the GMHMM instead of GaussianHMM.
# Another example
model = hmm.GaussianHMM(n_components=5, covariance_type="diag", n_iter=100)
K = 3 # Number of sites
model.n_features = K # initialise that the model has size of observations = K
# Create a random training sequence (only 1 sequence) with length = 100.
X1 = np.random.randn(100, K) # 100 observation for K sites
model.fit(X1)
# Sample the fitted model
X, Z = model.sample(200)
In a book I have found the following code which fits a LinearRegression to quadratic data:
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
But how could that be? I know from the documentation that PolynomialFeatures(degree=2, include_bias=False) is creating an array which looks like:
[[X[0],X[0]**2]
[X[1],X[1]**2]
.....
[X[n],X[n]**2]]
BUT: How is the LinearRegression able to fit this data? Means WHAT is the LinearRegression doing and what is the concept behind this.
I am grateful for any explanations!
PolynomialFeatures with degree two will create an array that looks like:
[[1, X[0], X[0]**2]
[1, X[1], X[1]**2]
.....
[1, X[n] ,X[n]**2]]
Let's call the matrix above X. Then the LinearRegression is looking for 3 numbers a,b,c so that the vector
X* [[a],[b],[c]] - Y
has the smallest possible mean squared error (which is just the mean of the sum of the squares in the vector above).
Note that the product X* [[a],[b],[c]] is just a product of the matrix X with the column vector [a,b,c].T . The result is a vector of the same dimension as Y.
Regarding the questions in your comment:
This function is linear in the new set of features: x, x**2. Just think about x**2 as an additional feature in your model.
For the particular array mentioned in your question, the LinearRegression method is looking for numbers a,b,c that minimize the sum
(a*1+bX[0]+cX[0]**2-Y[0])**2+(a*1+bX[1]+cX[1]**2-Y[1])**2+..+(a*1+bX[n]+cX[n]**2-Y[n])**2
So it will find a set of such numbers a,b,c. Hence the suggested function y=a+b*x+c*x**2 is not based only on the first row. Instead, it is based on all the rows, because the parameters a,b,c that are chosen are those that minimize the sum above, and this sum involves elements from all the rows.
Once you created the vector x**2, the linear regression just regards it as an additional feature. You can give it a new name v=x**2. Then the linear regression is of the form y=a+b*x+c*v, which means, it is linear in x and v. The algorithm does not care how you created v. It just treats v as an additional feature.
I have the X array shape (40*100) Y array contains 40 elements.
IS it possible to do OLS, WLS?? how to do that?
after the scatter plot.
How to apply the least square to find the relationship between X and Y. for example I would like to generate the equation of X and Y .
Here I give the simple example.
X=[[0.0,0.03,0.04,0.0,0.1,0.1,0.7,0.5,0.3,0.6],
[0.0,0.0,0.4,0.5,0.1,0.1,0.03,0.04,0.0,0.1],
[0.6,0.7,0.0,0.8,0.1,0.1,0.1,0.1,0.7,0.5],
[0.3,0.6,0.1,0.5,0.6,0.1,0.4,0.5,0.1,0.1]]
Y=[1,4,2,5]
Whether or not OLS or WLS is appropriate is one question (e.g. linear dependence among features is requires a different approach or if your response (Y variable) is discrete then you wouldn't use OLS, but instead use logistic regression or something else), but performing it in Python using your data is as follows:
import numpy as np
import numpy.linalg as la
X = np.array([[0.0,0.03,0.04,0.0,0.1,0.1,0.7,0.5,0.3,0.6],
[0.0,0.0,0.4,0.5,0.1,0.1,0.03,0.04,0.0,0.1],
[0.6,0.7,0.0,0.8,0.1,0.1,0.1,0.1,0.7,0.5],
[0.3,0.6,0.1,0.5,0.6,0.1,0.4,0.5,0.1,0.1]])
Y = np.array([1,4,2,5])
OLS = la.lstsq(X,Y)[0]
print OLS
[-0.60940892 0.19707325 3.94166269 4.06073677 2.76949291
0.90614714 0.92161768 1.5417828 -1.87887552 -0.63917305]
Note that this yields a perfect solution:
np.allclose(X.dot(OLS),Y)
True