Derivative of patsy dmatrix with respect to a specific variable - python

Edit: I now have a candidate solution to my question (see toy example below) -- if you can think of something more robust, please let me know.
I just found out about python's patsy package for creating design matrices from R-style formulas, and it looks great. My question is this: given a patsy formula, e.g. "(x1 + x2 + x3)**2", is there an easy way to create a design matrix containing the derivative with respect to a particular variable, e.g. "x1"?
Here's a toy example:
import numpy as np
import pandas as pd
import patsy
import sympy
import sympy.parsing.sympy_parser as sympy_parser
n_obs = 200
df = pd.DataFrame(np.random.uniform(size=(n_obs, 3)), columns=["x1", "x2", "x3"])
df.describe()
design_matrix = patsy.dmatrix("(I(7*x1) + x2 + x3)**2 + I(x1**2) + I(x1*x2*x3)", df)
design_matrix.design_info.column_names
## ['Intercept', 'I(7 * x1)', 'x2', 'x3', 'I(7 * x1):x2', 'I(7 * x1):x3', 'x2:x3', 'I(x1 ** 2)', 'I(x1 * x2 * x3)']
x1, x2, x3 = sympy.symbols("x1 x2 x3")
def diff_wrt_x1(string):
return str(sympy.diff(sympy_parser.parse_expr(string), x1))
colnames_to_differentiate = [colname.replace(":", "*").replace("Intercept", "1").replace("I", "")
for colname in design_matrix.design_info.column_names]
derivatives_wrt_x1 = [diff_wrt_x1(colname) for colname in colnames_to_differentiate]
def get_column(string):
try:
return float(string) * np.ones((len(df), 1)) # For cases like string == "7"
except ValueError:
return patsy.dmatrix("0 + I(%s)" % string, df) # For cases like string == "x2*x3"
derivative_columns = tuple(get_column(derivative_string) for derivative_string in derivatives_wrt_x1)
design_matrix_derivative = np.hstack(derivative_columns)
design_matrix_derivative[0] # Contains [0, 7, 0, 0, 7*x2, 7*x3, 0, 2*x1, x2*x3]
design_matrix_derivative_manual = np.zeros_like(design_matrix_derivative)
design_matrix_derivative_manual[:, 1] = 7.0
design_matrix_derivative_manual[:, 4] = 7*df["x2"]
design_matrix_derivative_manual[:, 5] = 7*df["x3"]
design_matrix_derivative_manual[:, 7] = 2*df["x1"]
design_matrix_derivative_manual[:, 8] = df["x2"] * df["x3"]
np.all(np.isclose(design_matrix_derivative, design_matrix_derivative_manual)) # True!
The code generates a design matrix with columns [1, 7*x1, x2, x3, 7*x1*x2, 7*x1*x3, x2*x3, x1^2, x1*x2*x3].
Suppose I want a new formula which differentiates design_matrix with respect to x1. The desired result is a matrix of the same shape as design_matrix, but whose columns are [0, 7, 0, 0, 7*x2, 7*x3, 0, 2*x1, x2*x3]. Is there a programmatic way to do that? I've tried searching the patsy docs as well as stackoverflow and I don't see anything. Of course I can create the derivative matrix manually, but it would be great to have a function that does it (and that doesn't have to be updated when I change the formula to, say, "(x1 + x2 + x3 + x4)**2 + I(x1**3)").

Related

Why am I getting an incorrect result from multiplying an inverted matrix by a vector?

I'm trying to learn Python for basic work in linear algebra. I'm running into the following problem with a simple system of linear equations:
import scipy.linalg as la
import numpy as np
A = np.array([[186/450, 54/21, 30/60],
[12/450, 6/21 , 3/60],
[9/450, 6/21 , 15/60]])
l = np.array([18/450, 12/21, 30/36])
b = np.array([2, 0, 1/6])
y = np.array([180, 0, 30])
x = la.inv(np.eye(3) - A) # y
lam = np.transpose(l) # la.inv(np.eye(3) - A)
This returns
array([0.21212121, 2.12121212, 1.39393939])
which is incorrect. Performing the same operation in Julia,
A = [186/450 54/21 30/60;
12/450 6/21 3/60;
9/450 6/21 15/60]
l = [18/450, 12/21, 30/60]
b = [2, 0, 1/6]
y = [180, 0, 30]
λ = l' * inv(I - A)
yields the correct result, which is
1×3 adjoint(::Vector{Float64}) with eltype Float64:
0.181818 1.81818 0.909091
What am I missing here? I think I might be missing something in the opaque numpy array syntax.
There is a typo in l instantiation in your python code. (30/36 should be 30/60).
This code with the typo fixed produces the same result as in Julia.
import scipy.linalg as la
import numpy as np
A = np.array([[186/450, 54/21, 30/60],
[12/450, 6/21 , 3/60],
[9/450, 6/21 , 15/60]])
l = np.array([18/450, 12/21, 30/60]) #typo fixed here
b = np.array([2, 0, 1/6])
y = np.array([180, 0, 30])
x = la.inv(np.eye(3) - A) # y
lam = np.transpose(l) # la.inv(np.eye(3) - A)
Giving:
array([0.18181818, 1.81818182, 0.90909091])

Find the point of intersection of two linear equations using Numpy

The objective is to find the point of intersection of two linear equations. These two linear equation are derived using the Numpy polyfit functions.
Given two time series (xLeft, yLeft) and (xRight, yRight), the linear least suqares fit to each of them was calculated using polyfit as shown below:
xLeft = [
6168, 6169, 6170, 6171, 6172, 6173, 6174, 6175, 6176, 6177,
6178, 6179, 6180, 6181, 6182, 6183, 6184, 6185, 6186, 6187
]
yLeft = [
0.98288751, 1.3639959, 1.7550986, 2.1539073, 2.5580614,
2.9651523, 3.3727503, 3.7784295, 4.1797948, 4.5745049,
4.9602985, 5.3350167, 5.6966233, 6.0432272, 6.3730989,
6.6846867, 6.9766307, 7.2477727, 7.4971657, 7.7240791
]
xRight = [
6210, 6211, 6212, 6213, 6214, 6215, 6216, 6217, 6218, 6219,
6220, 6221, 6222, 6223, 6224, 6225, 6226, 6227, 6228, 6229,
6230, 6231, 6232, 6233, 6234, 6235, 6236, 6237, 6238, 6239,
6240, 6241, 6242, 6243, 6244, 6245, 6246, 6247, 6248, 6249,
6250, 6251, 6252, 6253, 6254, 6255, 6256, 6257, 6258, 6259,
6260, 6261, 6262, 6263, 6264, 6265, 6266, 6267, 6268, 6269,
6270, 6271, 6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279,
6280, 6281, 6282, 6283, 6284, 6285, 6286, 6287, 6288]
yRight=[
7.8625913, 7.7713094, 7.6833806, 7.5997391, 7.5211883,
7.4483986, 7.3819046, 7.3221073, 7.2692747, 7.223547,
7.1849418, 7.1533613, 7.1286001, 7.1103559, 7.0982385,
7.0917811, 7.0904517, 7.0936642, 7.100791, 7.1111741,
7.124136, 7.1389918, 7.1550579, 7.1716633, 7.1881566,
7.2039142, 7.218349, 7.2309117, 7.2410989, 7.248455,
7.2525721, 7.2530937, 7.249711, 7.2421637, 7.2302341,
7.213747, 7.1925621, 7.1665707, 7.1356878, 7.0998487,
7.0590014, 7.0131001, 6.9621005, 6.9059525, 6.8445964,
6.7779589, 6.7059474, 6.6284504, 6.5453324, 6.4564347,
6.3615761, 6.2605534, 6.1531439, 6.0391097, 5.9182019,
5.7901659, 5.6547484, 5.5117044, 5.360805, 5.2018456,
5.034656, 4.8591075, 4.6751242, 4.4826899, 4.281858,
4.0727611, 3.8556159, 3.6307325, 3.3985188, 3.1594861,
2.9142516, 2.6635408, 2.4081881, 2.1491354, 1.8874279,
1.6242117,1.3607255,1.0982931,0.83831298
]
left_line = np.polyfit(xleft, yleft, 1)
right_line = np.polyfit(xRight, yRight, 1)
In this case, polyfit outputs the coeficients m and b for y = mx + b, respectively.
The intersection of the two linear equations then can be calculated as follows:
x0 = -(left_line[1] - right_line[1]) / (left_line[0] - right_line[0])
y0 = x0 * left_line[0] + left_line[1]
However, I wonder whether there exist Numpy build-in approach to calculate the last two steps?
Not exactly a built-in approach, but you can simplify the problem. Say I have lines given my y = m1 * x + b1 and y = m2 * x + b2. You can trivially find an equation for the difference, which is also a line:
y = (m1 - m2) * x + (b1 - b2)
Notice that this line will have a root at the intersection of the two original lines, if they intersect. You can use the numpy.polynomial.Polynomial class to perform these operations:
>>> (np.polynomial.Polynomial(left_line[::-1]) - np.polynomial.Polynomial(right_line[::-1])).roots()
array([6192.0710885])
Notice that I had to swap the order of the coefficients, since Polynomial expects smallest to largest, while np.polyfit returns the opposite. In fact, np.polyfit is not recommended. Instead, you can get Polynomial obejcts directly using np.polynomial.Polynomial.fit class method. Your code would then look like:
left_line = np.polynomial.Polynomial.fit(xLeft, yLeft, 1, domain=[-1, 1])
right_line = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=[-1, 1])
x0 = (left_line - right_line).roots()
y0 = left_line(x0)
The domain is mapped to the window [-1, 1]. If you do not specify a domain, the peak-to-peak of the x-values will be used instead. You do not want this, since it will result in a mapping of the input values. Instead, we explicitly specify that the domain [-1, 1] maps to the same window. An alternative would be to use the default domain and set e.g. window=[xLeft.min(), xLeft.max()]. The problem with this approach is that it would then create different domains for the polynomials, preventing the operation left_line - right_line.
See https://numpy.org/doc/stable/reference/routines.polynomials.classes.html for more information.
You can model it as a linear system and use simple linear algebra:
def get_intersection(m1,b1,m2,b2):
A = np.array([[-m1, 1], [-m2, 1]])
b = np.array([[b1], [b2]])
# you have to solve linear System AX = b where X = [x y]'
X = np.linalg.pinv(A) # b
x, y = np.round(np.squeeze(X), 4)
return x, y # returns point of intersection (x,y) with 4 decimal precision
m1,b1,m2,b2 = left_line(0), left_line(1), right_line(0), right_line(1)
print(get_intersection(m1,b1,m2,b2))
As an example, for lines y - x = 1, and y + x = 1, we expect the intersection as (0,1):
m1,b1,m2,b2 = 1, 1, -1, 1
print(get_intersection(m1,b1,m2,b2))
Output: (0.0, 1.0) as expected.

Sympy get vector from vector field

I'm using sympy (which is awesome) and I just made a vector field like this
> import sympy
> from sympy.vector import CoordSys3D
> from sympy import *
> R = CoordSys3D('R')
> x, y, z, t = symbols('x y z t')
> v = x*R.i + 4*z*R.j + y*R.k
x*R.i + 4*z*R.j + y*R.k
> v.evalf(subs={x:6, y:5, z:2})
6.00000000000000*R.i + 8.00000000000000*R.j + 5.00000000000000*R.k
and what I need is to get a vector or list of the form [6.0,8.0,5.0], so is there a way to get a list form v.evalf()? I could use use split or something on "6.00000000000000*R.i + 8.00000000000000*R.j + 5.00000000000000*R.k" but thats seems ugly and maybe there a built in method for that?
In [252]: vector = v.evalf(subs={x:6, y:5, z:2}); vector
Out[252]: 6.00000000000000*R.i + 8.00000000000000*R.j + 5.00000000000000*R.k
In [253]: list(vector.to_matrix(R))
Out[253]: [6.00000000000000, 8.00000000000000, 5.00000000000000]
Other possibilities include
In [256]: vector.as_poly().coeffs()
Out[256]: [6.00000000000000, 8.00000000000000, 5.00000000000000]
In [257]: list(vector.components.values())
Out[257]: [5.00000000000000, 8.00000000000000, 6.00000000000000]
but I think they suffer a fatal flaw which is exposed when one or more of the components equal 0. For example, if z is set to 0:
In [258]: vector = v.evalf(subs={x:6, y:5, z:0}); vector
Out[258]: 6.00000000000000*R.i + 5.00000000000000*R.k
Then list(vector.to_matrix(R)) still returns 3 components:
In [259]: list(vector.to_matrix(R))
Out[259]: [6.00000000000000, 0, 5.00000000000000]
while these other two expressions omit the zero-component:
In [260]: vector.as_poly().coeffs()
Out[260]: [6.00000000000000, 5.00000000000000]
In [261]: list(vector.components.values())
Out[261]: [5.00000000000000, 6.00000000000000]

Find the position of a lowest difference between numpy arrays

I've got two musical files: one lossless with little sound gap (at this time it's just silence but it could be anything: sinusoid or just some noise) at the beginning and one mp3:
In [1]: plt.plot(y[:100000])
Out[1]:
In [2]: plt.plot(y2[:100000])
Out[2]:
This lists are similar but not identical so I need to cut this gap, to find the first occurrence of one list in another with lowest delta error.
And here's my solution (5.7065 sec.):
error = []
for i in range(25000):
y_n = y[i:100000]
y2_n = y2[:100000-i]
error.append(abs(y_n - y2_n).mean())
start = np.array(error).argmin()
print(start, error[start]) #23057 0.0100046
Is there any pythonic way to solve this?
Edit:
After calculating the mean distance between special points (e.g. where data == 0.5) I reduce the area of search from 25000 to 2000. This gives me reasonable time of 0.3871s:
a = np.where(y[:100000].round(1) == 0.5)[0]
b = np.where(y2[:100000].round(1) == 0.5)[0]
mean = int((a - b[:len(a)]).mean())
delta = 1000
error = []
for i in range(mean - delta, mean + delta):
...
What you are trying to do is a cross-correlation of the two signals.
This can be done easily using signal.correlate from the scipy library:
import scipy.signal
import numpy as np
# limit your signal length to speed things up
lim = 25000
# do the actual correlation
corr = scipy.signal.correlate(y[:lim], y2[:lim], mode='full')
# The offset is the maximum of your correlation array,
# itself being offset by (lim - 1):
offset = np.argmax(corr) - (lim - 1)
You might want to take a look at this answer to a similar problem.
Let's generate some data first
N = 1000
y1 = np.random.randn(N)
y2 = y1 + np.random.randn(N) * 0.05
y2[0:int(N / 10)] = 0
In these data, y1 and y2 are almost the same (note the small added noise), but the first 10% of y2 are empty (similarly to your example)
We can now calculate the absolute difference between the two vectors and find the first element for which the absolute difference is below a sensitivity threshold:
abs_delta = np.abs(y1 - y2)
THRESHOLD = 1e-2
sel = abs_delta < THRESHOLD
ix_start = np.where(sel)[0][0]
fig, axes = plt.subplots(3, 1)
ax = axes[0]
ax.plot(y1, '-')
ax.set_title('y1')
ax.axvline(ix_start, color='red')
ax = axes[1]
ax.plot(y2, '-')
ax.axvline(ix_start, color='red')
ax.set_title('y2')
ax = axes[2]
ax.plot(abs_delta)
ax.axvline(ix_start, color='red')
ax.set_title('abs diff')
This method works if the overlapping parts are indeed "almost identical". You will have to think of smarter alignment ways if the similarity is low.
I think what you are looking for is correlation. Here is a small example.
import numpy as np
equal_part = [0, 1, 2, 3, -2, -4, 5, 0]
y1 = equal_part + [0, 1, 2, 3, -2, -4, 5, 0]
y2 = [1, 2, 4, -3, -2, -1, 3, 2]+y1
np.argmax(np.correlate(y1, y2, 'same'))
Out:
7
So this returns the time-difference, where the correlation between both signals is at its maximum. As you can see, in the example the time difference should be 8, but this depends on your data...
Also note that both signals have the same length.

how to write symbol for sum over a variable's subscript in sympy

I want to write a sympy symbol for a summation, but the index summed over also appears as the subscript of a variable name in the summand. For example,
import numpy as np
import sympy
sympy.init_printing()
r = sympy.Symbol('r')
a = sympy.Matrix(sympy.symbols('a:4'))
rpowers = sympy.Matrix([r**i for i in range(len(a))])
long_expr = a.dot(rpowers)
n = sympy.Symbol('n')
a_n = sympy.Symbol('a_n')
short_expr = sympy.Sum(a_n * r**n, (n, 0, 3))
long_expr and short_expr denote the same thing mathematically. But with long_expr, I can substitute in the values for the a's and then lambdify that expression into a numpy function:
coeffed_long_expr = long_expr.subs(zip(a, [-1, 3, 23, 8]))
func_long_expr = sympy.lambdify([r], coeffed_long_expr, 'numpy')
How can I do the same with short_expr? Or is short_expr only useful for displaying the expression with a summation sign in this case? I would like to be able to display using the summation sign, especially for large ns.
You can accomplish this by using sympy.Function:
import sympy
a_seq = [-1, 3, 23, 8]
n, r = sympy.symbols('n, r')
a_n = sympy.Function('a')(n)
terms = 4
short_expr = sympy.Sum(a_n * r**n, (n, 0, terms - 1))
coeffed_short_expr = short_expr.doit().subs(
(a_n.subs(n, i), a_seq[i]) for i in range(terms)) # 8*r**3 + 23*r**2 + 3*r - 1
func_short_expr = sympy.lambdify(r, coeffed_short_expr, 'numpy')
If you wish for a cleaner, more efficient implementation, I suspect you may be able to define a subclass of sympy.Symbol that implements subs() properly for summations.

Categories