this is one of those questions that's probably going to be totally obvious once answered, but for now I'm stuck.
I'm trying to re-create an equation from a result dataset and the four parameters that produced it.
The data is in a matrix with the last column being the result.
I saw that numpy.polyfit allows multiple values for y, so I tried...
result=data[:,-1]
variables=data[:,0:-1]
factors=numpy.polyfit(result,variables,2)
Result comes out is:
[[-4.69652251e-01 8.09734523e-01 1.93673361e-02 -1.62700198e+00]
[ 1.42092582e+01 -7.06024402e+00 -9.94583683e-02 1.11882833e+01]
[ 7.44030682e+00 2.08161127e+01 2.65025708e-01 1.14229534e+01]]
I'm assuming the result coefficients are in the form
[[A^2,B^2,C^2,D^2]
[A ,B, C, D]
[const,const,const,const]]
Which is a little puzzling, especially since if I apply the coefficients to the input data I don't seem to be getting anything even close to the result data.
First off, am I even right about the meaning of polyfit's results?
Second, why are there four constants, all different? Am I supposed to add them together, or what?
Is this merely solving A vs result, then B vs result, etc, rather than combined multi-dimensional minimizing of the whole?? (And if so, how could I do that instead?)
Or am I just misguided what polyfit is doing in the first place?
Polyfit docs tell us that
Several data sets of sample points sharing the same x-coordinates can
be fitted at once by passing in a 2D-array that contains one dataset
per column.
Let us understand it.
Firstly, let us consider an example. Say we have 3 points on the plane and want to interpolate them by polynomial of degree 1. It means that we want to plot a line through given 3 points, and this line should have minimal squared distance to this point.
Say, we have 3 points: (1, 1), (2, 2), (3, 3). Obviously, it is possible to find the line which is going through these points without any error, and this line is y = x. If we think of line in terms of y = a * x + b, then a = 1, b = 0.
Good. Now let us start from giving this example to numpy polyfit:
X = np.array([1, 2, 3])
y = np.array([1, 2, 3])
a, b = np.polyfit(X, y, deg=1)
(a, b)
>>> (0.9999999999999997, 1.2083031466395714e-15)
a * 1000 + b
>>> 999.9999999999997
Nice. Now let us make the example with matrix instead of one vector of y. Docs told us that we are just having multiple lines with the same X coordinates. Let us check this. We take two sets of points: (1, 1), (2, 2), (3, 3) with the line y = x that fits them and (1, 2), (2, 4), (3, 6). The fitting line is y = 2x (check!).
We are transposing the second matrix because polyfit wants it.
X = np.array([1, 2, 3])
y = np.array([[1, 2, 3], [2, 4, 6]]).T
coeff = np.polyfit(X, y, deg=1)
coeff
>>> array([[1.00000000e+00, 2.00000000e+00],
[1.20830315e-15, 2.41660629e-15]])
We see that we have a matrix with first row (1, 2) and second row (0, 0). So the first column contains coefficients for the first line, and second one -- for the second line. Let us check:
a, b = coeff[:, 0]
a * 10 + b
>>> 9.999999999999998
a, b = coeff[:, 1]
a * 100 + b
>>> 199.99999999999994
So, you can pass multiple lines with the same X coordinates and get many fits simultaneously. It can be useful, for example, for transforming features for the whole bunch of data.
Related
I am reading a text (see K-nearest neighors example)
which gives this line of code
dist_sq = np.sum((X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)
Here X is a numpy 10x2 array which represents 10 points in the 2D plane.
It was initialized like this:
X = np.random.rand(10, 2)
OK... The text claims this line computes the pairs of squared distances between the points.
I have no idea why this works and if it works. I tried understanding it but I just can't. I personally try to avoid such cryptic code. This is just not human IMHO. The text explains this code in some details but it seems I don't get that explanation either.
Also, axis=-1 adds up to the confusion.
Could someone decrypt this line of code?
Also, what is the point of saying e.g. X[:,np.newaxis,:], X[np.newaxis,:,:]?
Isn't X[:,np.newaxis], X[np.newaxis,:] enough? Isn't it doing the same?!
Also, from combinatorics, the squared distances count should be 10*9/2 or 10*10/2 (if we include equal points which have distance 0), but this dist_sq is a 10x10x2 array. So this also adds up to the confusion?! Why 200 elements?!
You could analysis different parts of your code simply.
Check X shape: X.shape=(10, 2) .What does X[np.newaxis,:,:] do in this command?
It adds new dimension as first dimension of X and convert to (1, 10, 2) dimension numpy array. Similarly X[:,np.newaxis,:] creats (10, 1, 2) numpy array.
(X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2 has (10, 10, 2) dimension.
How about: dist_sq = np.sum((X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1). It calculates euclidean distance between each pair of points in X
for example:
Y =
array([[0.79410882, 0.38156374],
[0.93574123, 0.6510161 ]])
Results of (Y[:,np.newaxis,:] - Y[np.newaxis,:,:]) ** 2 has (2, 2, 2) dimension and np.sum do summation on specific dimension: which one : axis=-1.
dist_sq = np.sum((Y[:,np.newaxis,:] - Y[np.newaxis,:,:]) ** 2, axis=-1)
dist_sq=
array([[0. , 0.09266431],
[0.09266431, 0. ]])
For example :
(0.79410882-0.93574123)**2 + (0.38156374-0.6510161)**2 = 0.09266431387197768
So final solution is a square matrix that is symmetrical.
I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.
Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!
This solution implements the distance in numpy, as I think it is a good example of broadcasting, which is a very useful thing to know if you need to use arrays and matrices.
By definition of Manhattan distance, you need to evaluate the sum of the absolute value of difference between each column. However, the first column of x, x[:, 0], has shape (4,) and the first column of y, y[:, 0], has shape (2,), so they are not compatible in the sense of applying subtraction: the broadcasting property says that each shape is compared starting with the trailing dimensions and two dimensions are compatible when they are equal or one of them is 1. Sadly, none of them are true for your columns.
However, you can add a new dimension of value 1 using np.newaxis, so
x[:, 0]
is array([1, 2, 4, 5]), but
x[:, 0, np.newaxis]
is
array([[1],
[2],
[4],
[5]])
and its shape is (4 ,1). Now, a matrix of shape (4, 1) subtracted by an array of shape 2 results in a matrix of shape (4, 2), by numpy's broadcasting treatment:
4 x 1
2
= 4 x 2
You can obtain the differences for each column:
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
and evaluate the sum of their absolute values:
np.abs(first_column_difference) + np.abs(second_column_difference)
which results in a (4, 2) matrix. Now, you want to sum the values for each row, so that you have 4 values:
np.sum(np.abs(first_column_difference) + np.abs(second_column_difference), axis=1)
which results in array([73, 69, 61, 57]). The rule is simple: the parameter axis will eliminate that dimension from the result, therefore using axis=1 for a (4, 2) matrix generates 4 values -- if you use axis=0, it will generate 2 values.
So, this will solve your problem:
x = np.array([[1, 2], [2, 3], [4, 5], [5, 6]])
y = np.array([[11, 13], [12, 43]])
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
z = np.abs(first_column_difference) + np.abs(second_column_difference)
print(np.sum(z, axis=1))
You can also skip the intermediate steps for each column and evaluate everything at once (it is a little bit harder to understand, so I prefer the method described above to explain what is happening):
print(np.abs(x[:, np.newaxis] - y).sum(axis=(1, 2)))
It is a general case for an n-dimensional Manhattan distance: if x is (u, n) and y is (v, n), it generates u rows by broadcasting (u, 1, n) by (v, n) = (u, v, n), then applying sum to eliminate the second and third axis.
Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
last element of shape represent x and y of a point, size 2
we can think of start_points as having 1 row and 4 columns of points
we can think of dest_points as having 2 rows and 1 columns of points
We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
Result is arithmetic operation between every pair of elements on start_points and dest_points
Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more
With so many rows you can make substantial savings by using a smart algorithm. Let us for simplicity assume there is just one dimension; once we have established the algorithm, getting back to the general case is a simple matter of summing over coordinates.
The naive algorithm is O(mn) where m,n are the sizes of sets X,Y. Our algorithm is O((m+n)log(m+n)) so it scales much better.
We first have to sort the union of X and Y by coordinate and then form the cumsum over Y. Next, we find for each x in X the number YbefX of y in Y to its left and use it to look up the corresponding cumsum item YbefXval. The summed distances to all y to the left of x are YbefX times coordinate of x minus YbefXval, the distances to all y to the right are sum of all y coordinates minus YbefXval minus n - YbefX times coordinate of x.
Where does the saving come from? Sorting coordinates enables us to recycle the summations we have done before, instead of starting each time from scratch. This uses the fact that up to a sign we always sum the same y coordinates and going from left to right the signs flip one by one.
Code:
import numpy as np
from scipy.spatial.distance import cdist
from timeit import timeit
def pp(X,Y):
(m,k),(n,k) = X.shape,Y.shape
XY = np.concatenate([X.T,Y.T],1)
idx = XY.argsort(1)
Xmsk = idx<m
Ymsk = ~Xmsk
Xidx = np.arange(k)[:,None],idx[Xmsk].reshape(k,m)
Yidx = np.arange(k)[:,None],idx[Ymsk].reshape(k,n)
YbefX = Ymsk.cumsum(1)[Xmsk].reshape(k,m)
YbefXval = XY[Yidx].cumsum(1)[np.arange(k)[:,None],YbefX-1]
YbefXval[YbefX==0] = 0
XY[Xidx] = ((2*YbefX-n)*XY[Xidx]) - 2*YbefXval + Y.sum(0)[:,None]
return XY[:,:m].sum(0)
def summed_cdist(X,Y):
return cdist(X,Y,"minkowski",p=1).sum(1)
# demo
m,n,k = 1000,500,10
X,Y = np.random.randn(m,k),np.random.randn(n,k)
print("same result:",np.allclose(pp(X,Y),summed_cdist(X,Y)))
print("sort :",timeit(lambda:pp(X,Y),number=1000),"ms")
print("scipy cdist:",timeit(lambda:summed_cdist(X,Y),number=100)*10,"ms")
Sample run, comparing smart algo "sort" to naive algo implemented using cdist library function:
same result: True
sort : 1.4447695480193943 ms
scipy cdist: 36.41934019047767 ms
I want to solve the linear equation for n given points in n dimensional space to get the equation of hyper-plane.
for example, in two dimensional case, Ax + By + C = 0.
How can I get one solution if there are infinite solutions in a linear equations ?
I have tried scipy.linalg.solve() but it requires coefficient matrix A to be nonsingular.
I also tried sympy
A = Matrix([[0, 0, 1], [1, 1, 1]])
b = Matrix([0, 0])
linsolve((A, b), [x, y, z])
It returned me this
{(−y,y,0)}
I have to parse the result to determine which one is the free variable and then assign a number to it to get a solution.
Is there a more convenient way since I only want to get a specific solution ?
I have a two-dimensional array that I want to fill up with values that represent powers but my problem lies in the speed of the code because the two-dimensional array is 100x100 size and I don't want to first initialize it with 100x100 list of zereos then fill up the list with values but rather fill up the 100x100 two-dimensional list by values directly. My code is shown down below
x_list = np.linspace(min_x, max_x, (max_x - min_x)+1)
y_list = np.linspace(min_y, max_y, (max_y - min_y)+1)
X, Y = np.meshgrid(x_list, y_list)
Y = Y[::-1]
Z = [[0 for x in range(len(x_list))] for x in range(len(y_list))] #Z is the two-dimensional list containing powers of reach position in the structure to be plotted
for each_axes in range(len(Z)):
for each_point in range(len(Z[each_axes])):
Z[len(Z)-1-each_axes][each_point] = power_at_each_point(each_point, each_axes)
#The method power_at_each_point is the one that calculates the values in the two-dimensional array Z
An example what I want to do is instead of doing what is shown below:
Z_old = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
for each_axes in range(len(Z_old)):
for each_point in range(len(Z_old[each_axes])):
Z_old[len(Z_old)-1-each_axes][each_point] = power_at_each_point(each_point, each_axes)
I want now to not initialize the Z_old array with zeroes but rather fill it up with values while iterating through it which is going to be something like the written below although it's syntax is horribly wrong but that's what I want to reach in the end.
Z = np.zeros((len(x_list), len(y_list))) for Z[len(x_list) -1 - counter_1][counter_2] is equal to power_at_each_point(counter_1, counter_2] for counter_1 in range(len(x_list)) and counter_2 in range(len(y_list))]
plus the method of power_at_each_point is shown below with it's related methods if it helps you understand what I wanted to do:
#A method to calculate the power reached from one node to the other for contourf function
def cal_pow_rec_plandwall_contour(node_index_tx, receiver):
nodess_excel = xlrd.open_workbook(Node_file_location)
nodes_sheet = nodess_excel.sheet_by_index(0)
node_index_tx_coor = [nodes_sheet.cell_value(node_index_tx - 1, 3), nodes_sheet.cell_value(node_index_tx - 1, 4)] #just co-ordinates of a point
distance = cal_distance(node_index_tx_coor, receiver)
if distance == 0:
power_rec = 10 * (np.log10((nodes_sheet.cell_value(node_index_tx - 1, 0) * 1e-3)))
return power_rec #this is the power received at each position
else:
power_rec = 10 * (np.log10((nodes_sheet.cell_value(node_index_tx - 1, 0) * 1e-3))) - 20 * np.log10((4 * math.pi * distance * 2.4e9) / 3e8) - cal_wall_att([node_index_tx_coor, receiver])
return power_rec
def power_at_each_point(x_cord, y_coord): #A method to get each position in the structure and calculate the power reached at that position to draw the structure's contourf plot
fa = lambda xa: cal_pow_rec_plandwall_contour(xa, [x_cord, y_coord])
return max(fa(each_node) for each_node in range(1, len(Node_Positions_Ascending) + 1)) #Node_position_ascending is a list containing the co-ordinate positions of markers basically or nodes.
If someone could tell me how can I fill the two-dimensional array Z with values from the bottom of the top as I did right there without initially setting the two-dimensional array to zero first it would be much appreciated.
OK, first, you want to create a NumPy array, not a list of lists. This is almost always going to be significantly smaller, and a little faster to work on. And, more importantly, it opens the door to vectorizing your loops, which makes them a lot faster to work on. So, instead of this:
Z_old = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
… do this:
Z_old = np.zeros((3, 5))
But now let's see whether we can vectorize your loop instead of modifying the values:
for each_axes in range(len(Z_old)):
for each_point in range(len(Z_old[each_axes])):
Z_old[len(Z_old)-1-each_axes][each_point] = each_point**2 + each_axes**2
The initial values of Z[…] aren't being used at all here, so we don't need to pre-fill them with 0, just as you suspected. What is being used at each point is r and c. (I'm going to rename your Z_old, each_axes, and each_point to Z, r, and c for brevity.) In particular, you're trying to set each Z[len(Z)-1-r, c] to r**2 + c**2.
First, let's reverse the negatives so you're setting each Z[r, c] to something—in this case, to (len(Z)-1-r)**2 + c**2.
That "something" is just a function on r and c values. Which we can get by creating aranges. In particular, arange(5) is just an array of the numbers 0, 1, 2, 3, 4, and arange(5)**2 is an array of the squares 0, 1, 4, 9, 16.
The only problem is that to get a 3x5 array out of this, we have to elementwise add two 2D arrays, a 3x1 array and a 1x5 array, vice-versa, but we've got two 1D arrays from arange. Well, we can reshape one of them:
Z_old = (3 - 1 - np.arange(3))**2 + (np.arange(5)**2).reshape((5, 1))
You can, of course, simplify this further (you obviously don't need 3 - 1, and you can just add a new axis without reshape), but hopefully this shows directly how it corresponds to your original code.
I use numpy.polyfit to fit a 2nd order polynom to a set of data
fit1, fit_err1, _, _, _ = np.polyfit(xint[:index_max],
yint[:index_max],
2,
full=True)
For some few examples of my data, the variable fit_err1 is empty although the fit was successful, i.e. fit1 is not empty!
Does anybody know what an empty residual means in this context? Thank you!
EDIT:
one example data set:
x = [-488., -478., -473.]
y = [ 0.02080881, 0.03233648, 0.03584448]
fit1, fit_err1, _, _, _ = np.polyfit(x, y, 2, full=True)
result:
fit1 = [ -3.00778818e-05 -2.79024663e-02 -6.43272769e+00]
fit_err1 = []
I know that fitting a 2nd order polynom to a set of three point is not very useful, but then i still expect the function to either raise a warning, or (as it actually determined a fit) return the actual residuals, or both (like "here are the residuals, but your conditions are poor!").
As pointed out by #Jaime, if you have three points a second order polynomial will fit it exactly. And your point that the error should be rather 0 than an empty array makes sense, but this is the current behavior of np.linalg.lstsq, which is where np.polyfit is wrapped around.
We can test this behavior doing the least-squares fit of a y = a*x**0 + b*x**1 + c*x**2 equation that we know the answer should be a=0, b=0, c=1:
np.linalg.lstsq([[1, 1 ,1], [1, 2, 4], [1, 3, 9]], [1, 4, 9])
#(array([ -3.43396424e-15, 3.88578059e-15, 1.00000000e+00]),
# array([], dtype=float64),
# 3,
# array([ 10.64956309, 1.2507034 , 0.15015641]))
where we can see that the second output is an empty array. And this is intended to work like this.