Inconsistency in RANSAC implementation from Scipy Cookbook - python

Recently I've checked the RANSAC implementation from the Cookbook: http://wiki.scipy.org/Cookbook/RANSAC , but it doesn't seem to be consistent with the RANSAC algorithm itself.
Looking at the plot there, how can it be that some of data points, which are quite far from the best model (see points in the bottom), are considered as "RANSAC data", while some other points being closer to the model are not?
From my point of view It contradicts the main idea of the RANSAC algorithm where all points inside the pre-defined threshold area are considered as inliers.
Why is it not so in this implementation and are there any other RANSAC implementations in Python?
Thanks for your help!
Cheers,
Alexey

No it does not contradict the idea of RANSAC. The plot is a little misleading though.
What is plotted as blue crosses are the sample points (best_inlier_idxs = maybeinliers + alsoinliers) of which some (exactly the points in alsoinliers, ie the consensus set) support the model (maybemodel) which was fitted to a random sample of the data (maybeinliers). Which means all points given by alsoinliers should indeed be closer to the maybemodel than the closest point not in support of it.
However, the maybemodel fit is not shown in the plot. What is shown is the bettermodel (blue line, "RANSAC fit") which is obtained by fitting model parameters to all points in best_inlier_idxs (not just the ones in maybeinliers).
Furthermore, best_inlier_idxs contains both alsoinliers and maybeinliers. There may well be points in the randomly chose sample maybeinliers that do in fact not support the maybemodel fit (i.e. they are not within a threshold away). These points are also shown as blue crosses even though they are farther away than other points not in the supporting set.
I modified the plotting a little to also indicate the best proposed model (maybemodel) and the random sample (maybeinliers) within the "RANSAC data". The important thing are the circles around some of the crosses which highlight the fact that the random samples are contained in the RANSAC data.
Here's the code for the modified plotting:
iterations = 0
bestfit = None
besterr = numpy.inf
best_inlier_idxs = None
while iterations < k:
maybe_idxs, test_idxs = random_partition(n,data.shape[0])
maybeinliers = data[maybe_idxs,:]
test_points = data[test_idxs]
maybemodel = model.fit(maybeinliers)
test_err = model.get_error( test_points, maybemodel)
also_idxs = test_idxs[test_err < t] # select indices of rows with accepted points
alsoinliers = data[also_idxs,:]
if debug:
print 'test_err.min()',test_err.min()
print 'test_err.max()',test_err.max()
print 'numpy.mean(test_err)',numpy.mean(test_err)
print 'iteration %d:len(alsoinliers) = %d'%(
iterations,len(alsoinliers))
if len(alsoinliers) > d:
betterdata = numpy.concatenate( (maybeinliers, alsoinliers) )
bettermodel = model.fit(betterdata)
better_errs = model.get_error( betterdata, bettermodel)
thiserr = numpy.mean( better_errs )
if thiserr < besterr:
bestfit = bettermodel
besterr = thiserr
best_inlier_idxs = numpy.concatenate( (maybe_idxs, also_idxs) )
best_maybe_model = maybemodel
best_random_set = maybe_idxs
iterations+=1
if bestfit is None:
raise ValueError("did not meet fit acceptance criteria")
if return_all:
return bestfit, {'inliers':best_inlier_idxs, 'best_random_set':best_random_set,'best_maybe_model':best_maybe_model}
else:
return bestfit
def test():
# generate perfect input data
n_samples = 500
n_inputs = 1
n_outputs = 1
A_exact = 20*numpy.random.random((n_samples,n_inputs) )
perfect_fit = 60*numpy.random.normal(size=(n_inputs,n_outputs) ) # the model
B_exact = scipy.dot(A_exact,perfect_fit)
assert B_exact.shape == (n_samples,n_outputs)
# add a little gaussian noise (linear least squares alone should handle this well)
A_noisy = A_exact + numpy.random.normal(size=A_exact.shape )
B_noisy = B_exact + numpy.random.normal(size=B_exact.shape )
if 1:
# add some outliers
n_outliers = 100
all_idxs = numpy.arange( A_noisy.shape[0] )
numpy.random.shuffle(all_idxs)
outlier_idxs = all_idxs[:n_outliers]
non_outlier_idxs = all_idxs[n_outliers:]
A_noisy[outlier_idxs] = 20*numpy.random.random((n_outliers,n_inputs) )
B_noisy[outlier_idxs] = 50*numpy.random.normal(size=(n_outliers,n_outputs) )
# setup model
all_data = numpy.hstack( (A_noisy,B_noisy) )
input_columns = range(n_inputs) # the first columns of the array
output_columns = [n_inputs+i for i in range(n_outputs)] # the last columns of the array
debug = True
model = LinearLeastSquaresModel(input_columns,output_columns,debug=debug)
linear_fit,resids,rank,s = scipy.linalg.lstsq(all_data[:,input_columns],
all_data[:,output_columns])
# run RANSAC algorithm
ransac_fit, ransac_data = ransac(all_data,model,
50, 1000, 7e3, 300, # misc. parameters
debug=debug,return_all=True)
if 1:
import pylab
sort_idxs = numpy.argsort(A_exact[:,0])
A_col0_sorted = A_exact[sort_idxs] # maintain as rank-2 array
if 1:
pylab.plot( A_noisy[:,0], B_noisy[:,0], 'k.', label='data' )
pylab.plot( A_noisy[ransac_data['inliers'],0], B_noisy[ransac_data['inliers'],0], 'bx', label='RANSAC data' )
pylab.plot( A_noisy[ransac_data['best_random_set'],0], B_noisy[ransac_data['best_random_set'],0], 'ro', mfc='none',label='best random set (maybeinliers)' )
else:
pylab.plot( A_noisy[non_outlier_idxs,0], B_noisy[non_outlier_idxs,0], 'k.', label='noisy data' )
pylab.plot( A_noisy[outlier_idxs,0], B_noisy[outlier_idxs,0], 'r.', label='outlier data' )
pylab.plot( A_col0_sorted[:,0],
numpy.dot(A_col0_sorted,ransac_fit)[:,0],
label='RANSAC fit' )
pylab.plot( A_col0_sorted[:,0],
numpy.dot(A_col0_sorted,perfect_fit)[:,0],
label='exact system' )
pylab.plot( A_col0_sorted[:,0],
numpy.dot(A_col0_sorted,linear_fit)[:,0],
label='linear fit' )
pylab.plot( A_col0_sorted[:,0],
numpy.dot(A_col0_sorted,ransac_data['best_maybe_model'])[:,0],
label='best proposed model (maybemodel)' )
pylab.legend()
pylab.show()
if __name__=='__main__':
test()

Related

Unable to fit a function onto a givin set of data points in Python using the Scipy library

I have been trying to fit a function(the function is given in the code under the name concave_func) onto data points in python but have had very little to no success. I have 7 parameters(C_1, C_2, alpha_one, alpha_two, I_x, nu_t, T_e) in the function that I have to estimate, and only 6 data points. I have tried 2 methods to fit the curve and estimate the parameters,
1). scipy.optimize.minimize
2). scipy.optimize.curve_fit.
However, I'm not obtaining the desired results i.e the curve is not fitting the data points.
I have attached my code below.
frequency = np.array([22,45,150,408,1420,23000]) #x_values
b_temp = [2.55080863e+04, 4.90777800e+03, 2.28984753e+02, 2.10842949e+01, 3.58631166e+00, 5.68716056e-04] #y_values
#Defining the function that I want to fit
def concave_func(x, C_1, C_2, alpha_one, alpha_two, I_x, nu_t, T_e):
one = x**(-alpha_one)
two = (C_2/C_1)*(x**(-alpha_two))
three = I_x*(x**-2.1)
expo = np.exp(-1*((nu_t/x)**2.1))
eqn_one = C_1*(one + two + three)*expo
eqn_two = T_e*(1 - expo)
return eqn_one + eqn_two
#Defining chi_square function
def chisq(params, xobs, yobs):
ynew = concave_func(xobs, *params)
#yerr = np.sum((ynew- yobs)**2)
yerr = np.sum(((yobs- ynew)/ynew)**2)
print(yerr)
return yerr
result = minimize(chisq, [1,2,2,2,1,1e6,8000], args = (frequency,b_temp), method = 'Nelder-Mead', options = {'disp' : True, 'maxiter': 10000})
x = np.linspace(-300,24000,1000)
plt.yscale("log")
plt.xscale("log")
plt.plot(x,concave_func(x, *result.x))
print(result.x)
print(result)
plt.plot(frequency, b_temp, 'r*' )
plt.xlabel("log Frequency[MHz]")
plt.ylabel("log Temp[K]")
plt.title('log Temparature vs log Frequency')
plt.grid()
plt.savefig('the_plot_2060.png')
I have attached the plot that I obtained below.
The plot clearly does not fit the data, and something is definitely wrong. I would also want my parameters alpha_one and alpha_two to be constrained to lie between 2 and 3. I also do not want my parameter T_e to exceed 10,000. Any thoughts?

Getting a smooth Poisson Distribution over a Histogram with Small Number of Bins

I have a Poisson distribution of a background count which mostly contains counts equal to zero, I've fitted a Poisson distribution to this data and gotten the following result:
I have another dataset from a source which has higher count rates, in this case it works fine:
Here's my (inelegant) code in full;
mean_values = []
# obtaining results:
for a in data_arrays:
dataset = globals()[a]
cps_vals = dataset[:,1]
max_cps = int(max(cps_vals))
mean_name = a +"_mean"
std_name = a + "_std"
serr_name = a + "_serr"
mean = globals()[mean_name] = np.mean(cps_vals)
globals()[std_name] = np.std(cps_vals,ddof=1)
globals()[serr_name] = globals()[std_name]/np.sqrt(len(cps_vals)) ## I used globals() so I could call in e.g. the background serr as the variable bg_serr.
print(a,"mean:",globals()[mean_name],"sqrt(mean):",np.sqrt(globals()[mean_name]),"std:",globals()[std_name],"serr:",globals()[serr_name],"sqrt(lambda)/sigma =",np.sqrt(globals()[mean_name])/globals()[std_name])
# plotting with Poisson:
plt.figure()
bin_edges = np.arange(0, max_cps+1.1, 1)
histogram = plt.hist(cps_vals,density=True,bins=bin_edges)
plt.xlabel("Counts Per Second")
plt.ylabel("Probability of Occurence")
pops = histogram[0]
bins = histogram[1]
maxidx = np.argmax(pops)
maxpop = pops[maxidx]
maxbin = np.max(bins)
most_populated_bin = bins[maxidx]
plt.plot(np.arange(0, maxbin), poisson.pmf(np.arange(0,maxbin),
np.mean(cps_vals)),c="black")
This is the relevant line for the Poisson plot:
plt.plot(np.arange(0, maxbin), poisson.pmf(np.arange(0,maxbin), np.mean(cps_vals)),c="black")
If I try to make the np.arange spacing smaller, I get ringing in the Poisson curves:
I think this is because it needs integer values of counts?
How can I produce a smooth Guassian curve for the background count? The one I'm getting doesn't look right.
mu = 15
r = poisson.rvs(mu, size=100000)
plt.hist(r, bins=np.linspace(0, 35, 36), alpha=0.5, label='counting process', ec='black', align='left')
plt.plot(poisson.pmf(np.linspace(0, 35, 36),mu)*100000)
plt.legend()
Gives:

Distribution plot is showing flat pdf

I tried to plot the Probability Density Function (PDF) plot of my data after finding the best parameters, but the plot is showing a flat line instead of a curve.
Is it a matter of scaling?
Is it a problem of Continuous or Discrete data? Data file is available here
The purpose here is to get the best distribution fittings and then plot PDF function.
My data values are so small like: 0.21, 1.117 .etc. The data statistics and PDF plots are shown below:
My script is given below:
from time import time
from datetime import datetime
start_time = datetime.now()
import pandas as pd
pd.options.display.float_format = '{:.4f}'.format
import numpy as np
import pickle
import scipy
import scipy.stats
import matplotlib.pyplot as plt
data= pd.read_csv("line_RXC_data.csv",usecols=['R'],parse_dates=True, squeeze=True)
df=data
y_std=df
# del yy
import warnings
warnings.filterwarnings("ignore")
# Create an index array (x) for data
y=df
#
# Create an index array (x) for data
x = np.arange(len(y))
size = len(y)
#simple visualisation of the data
plt.hist(y)
plt.title("Histogram of resistance ")
plt.xlabel("Resistance data visualization ")
plt.ylabel("Frequency")
plt.show()
y_df = pd.DataFrame(y)
tt=y_df.describe()
print(tt)
dist_names = [
'foldcauchy',
'beta',
'expon',
'exponnorm',
'norm',
'lognorm',
'dweibull',
'pareto',
'gamma'
]
x = np.arange(len(df))
size = len(df)
y_std = df
y=df
chi_square = []
p_values = []
# Set up 50 bins for chi-square test
# Observed data will be approximately evenly distrubuted aross all bins
percentile_bins = np.linspace(0,100,51)
percentile_cutoffs = np.percentile(y_std, percentile_bins)
observed_frequency, bins = (np.histogram(y_std, bins=percentile_cutoffs))
cum_observed_frequency = np.cumsum(observed_frequency)
# Loop through candidate distributions
for distribution in dist_names:
s1 = time()
# Set up distribution and get fitted distribution parameters
dist = getattr(scipy.stats, distribution)
# print("1")
param = dist.fit(y_std)
# print("2")
# Obtain the KS test P statistic, round it to 5 decimal places
p = scipy.stats.kstest(y_std, distribution, args=param)[1]
p = np.around(p, 5)
p_values.append(p)
# print("3")
# Get expected counts in percentile bins
# This is based on a 'cumulative distrubution function' (cdf)
cdf_fitted = dist.cdf(percentile_cutoffs, *param[:-2], loc=param[-2],
scale=param[-1])
# print("4")
expected_frequency = []
for bin in range(len(percentile_bins)-1):
expected_cdf_area = cdf_fitted[bin+1] - cdf_fitted[bin]
expected_frequency.append(expected_cdf_area)
# calculate chi-squared
expected_frequency = np.array(expected_frequency) * size
cum_expected_frequency = np.cumsum(expected_frequency)
ss = sum (((cum_expected_frequency - cum_observed_frequency) ** 2) / cum_observed_frequency)
chi_square.append(ss)
print(f"chi_square {distribution} time: {time() - s1}")
# print("std of predicted probability : ", np.std(cum_observed_frequency))
# Collate results and sort by goodness of fit (best at top)
results = pd.DataFrame()
results['Distribution'] = dist_names
results['chi_square'] = chi_square
results['p_value'] = p_values
results.sort_values(['chi_square'], inplace=True)
# Report results
print ('\nDistributions sorted by goodness of fit:')
print ('----------------------------------------')
print (results)
#%%
# Divide the observed data into 100 bins for plotting (this can be changed)
number_of_bins = 100
bin_cutoffs = np.linspace(np.percentile(y,0), np.percentile(y,99),number_of_bins)
# Create the plot
plt.figure(figsize=(7, 4))
h = plt.hist(y, bins = bin_cutoffs, color='0.70')
# Get the top three distributions from the previous phase
number_distributions_to_plot = 5
dist_names = results['Distribution'].iloc[0:number_distributions_to_plot]
#%%
# Create an empty list to stroe fitted distribution parameters
parameters = []
# Loop through the distributions ot get line fit and paraemters
for dist_name in dist_names:
# Set up distribution and store distribution paraemters
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
parameters.append(param)
# Get line for each distribution (and scale to match observed data)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1])
scale_pdf = np.trapz (h[0], h[1][:-1]) / np.trapz (pdf_fitted, x)
pdf_fitted *= scale_pdf
# Add the line to the plot
plt.plot(pdf_fitted, label=dist_name)
# Set the plot x axis to contain 99% of the data
# This can be removed, but sometimes outlier data makes the plot less clear
plt.xlim(0,np.percentile(y,99))
# Add legend and display plotfig = plt.figure(figsize=(8,5))
plt.legend()
plt.title(u'Data distribution charateristics) \n' )
plt.xlabel(u'Resistance')
plt.ylabel('Frequency )')
plt.show()
# Store distribution paraemters in a dataframe (this could also be saved)
dist_parameters = pd.DataFrame()
dist_parameters['Distribution'] = (
results['Distribution'].iloc[0:number_distributions_to_plot])
dist_parameters['Distribution parameters'] = parameters
# Print parameter results
print ('\nDistribution parameters:')
print ('------------------------')
for index, row in dist_parameters.iterrows():
print ('\nDistribution:', row[0])
print ('Parameters:', row[1] )
If you look at the following categorical frequency analysis, you'll see that you have only 15 distinct values spread across the range with large gaps in between—not a continuum of values. Half the observations have the value 0.211, with another ~36% occurring at the value 1.117, ~8% at 0.194, and ~4% at 0.001. I think it's a mistake to treat this as continuous data.

how to isolate data that are 2 and 3 sigma deviated from mean and then mark them in a plot in python?

I am reading from a dataset which looks like the following when plotted in matplotlib and then taken the best fit curve using linear regression.
The sample of data looks like following:
# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13
My code looks the following:
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = M, Y_z # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
However I can see a lot upward scatter in the data and the best fit curve is affected by those. So first I want to isolate the data points which are 2 and 3 sigma away from my mean data, and mark them with circle around them.
Then take the best fit curve considering only the points which fall within 1 sigma of my mean data
Is there a good function in python which can do that for me?
Also in addition to that may I also isolate the data from my actual dataset, like if the third row in the sample input represents 2 sigma deviation may I have that row as an output too to save later and investigate more?
Your help is most appreciated.
Here's some code that goes through the data in a given number of windows, calculates statistics in said windows, and separates data in well- and misbehaved lists.
Hope this helps.
from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt
num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)
# Regression Function
def regress(x, y):
#Return a tuple of predicted y values and parameters for linear regression.
p = stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = polyval([b1, b0], x)
return y_pred, p
# plotting z
xz, yz = fake_data_x, fake_data_y # data, non-transformed
y_pred, _ = regress(xz, np.log(yz)) # change here # transformed input
plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit') # transformed output
plt.show()
num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []
for i in range(len(window_boundaries)-1):
# create a boolean mask to select the data within the averaging window
window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
# separate the pieces of data in the window
fake_data_x_slice = fake_data_x[window_indices]
fake_data_y_slice = fake_data_y[window_indices]
# calculate the mean y_value in the window
y_mean = np.mean(fake_data_y_slice)
y_std = np.std(fake_data_y_slice)
# choose and select the outliers
y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
# choose and select the good ones
y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
# extend the lists with all the good and the bad
y_good.extend(list(y_goodies))
y_outlier.extend(list(y_outliers))
x_good.extend(list(x_goodies))
x_outlier.extend(list(x_outliers))
plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()

Why do different methods for solving Xc=y in python give different solution when they should not?

I was trying to solve a linear system Xc=y that was square. The methods I know to solve this are:
using inverse c=<X^-1,y>
using Gaussian elimination
using the pseudo-inverse
It seems as far as I can tell that these don't match what I thought would be the ground truth.
First generate the truth parameters by fitting a polynomial of degree 30 to a cosine with frequency 5. So I have y_truth = X*c_truth.
Then I check if the above three methods match the truth
I tried it but the methods don't seem to match and I don't see why that should be the case.
I produced fully runnable reproducible code:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
## some parameters
degree_target = 25
N_train = degree_target+1
lb,ub = -2000,2000
x = np.linspace(lb,ub,N_train)
## generate target polynomial model
freq_cos = 5
y_cos = np.cos(2*np.pi*freq_cos*x)
c_polyfit = np.polyfit(x,y_cos,degree_target)[::-1] ## needs to me reverse to get highest power last
## generate kernel matrix
poly_feat = PolynomialFeatures(degree=degree_target)
K = poly_feat.fit_transform(x.reshape(N_train,1)) # generates degree 0 first
## get target samples of the function
y = np.dot(K,c_polyfit)
## get pinv approximation of c_polyfit
c_pinv = np.dot( np.linalg.pinv(K), y)
## get Gaussian-Elminiation approximation of c_polyfit
c_GE = np.linalg.solve(K,y)
## get inverse matrix approximation of c_polyfit
i = np.linalg.inv(K)
c_mdl_i = np.dot(i,y)
## check rank to see if its truly invertible
print('rank(K) = {}'.format( np.linalg.matrix_rank(K) ))
## comapre parameters
print('--c_polyfit')
print('||c_polyfit-c_GE||^2 = {}'.format( np.linalg.norm(c_polyfit-c_GE) ))
print('||c_polyfit-c_pinv||^2 = {}'.format( np.linalg.norm(c_polyfit-c_pinv) ))
print('||c_polyfit-c_mdl_i||^2 = {}'.format( np.linalg.norm(c_polyfit-c_mdl_i) ))
print('||c_polyfit-c_polyfit||^2 = {}'.format( np.linalg.norm(c_polyfit-c_polyfit) ))
##
print('--c_GE')
print('||c_GE-c_GE||^2 = {}'.format( np.linalg.norm(c_GE-c_GE) ))
print('||c_GE-c_pinv||^2 = {}'.format( np.linalg.norm(c_GE-c_pinv) ))
print('||c_GE-c_mdl_i||^2 = {}'.format( np.linalg.norm(c_GE-c_mdl_i) ))
print('||c_GE-c_polyfit||^2 = {}'.format( np.linalg.norm(c_GE-c_polyfit) ))
##
print('--c_pinv')
print('||c_pinv-c_GE||^2 = {}'.format( np.linalg.norm(c_pinv-c_GE) ))
print('||c_pinv-c_pinv||^2 = {}'.format( np.linalg.norm(c_pinv-c_pinv) ))
print('||c_pinv-c_mdl_i||^2 = {}'.format( np.linalg.norm(c_pinv-c_mdl_i) ))
print('||c_pinv-c_polyfit||^2 = {}'.format( np.linalg.norm(c_pinv-c_polyfit) ))
##
print('--c_mdl_i')
print('||c_mdl_i-c_GE||^2 = {}'.format( np.linalg.norm(c_mdl_i-c_GE) ))
print('||c_mdl_i-c_pinv||^2 = {}'.format( np.linalg.norm(c_mdl_i-c_pinv) ))
print('||c_mdl_i-c_mdl_i||^2 = {}'.format( np.linalg.norm(c_mdl_i-c_mdl_i) ))
print('||c_mdl_i-c_polyfit||^2 = {}'.format( np.linalg.norm(c_mdl_i-c_polyfit) ))
and I get the result:
rank(K) = 4
--c_polyfit
||c_polyfit-c_GE||^2 = 4.44089220304006e-16
||c_polyfit-c_pinv||^2 = 1.000000000000001
||c_polyfit-c_mdl_i||^2 = 1.1316233165135605e-06
||c_polyfit-c_polyfit||^2 = 0.0
--c_GE
||c_GE-c_GE||^2 = 0.0
||c_GE-c_pinv||^2 = 1.0000000000000007
||c_GE-c_mdl_i||^2 = 1.1316233160694804e-06
||c_GE-c_polyfit||^2 = 4.44089220304006e-16
--c_pinv
||c_pinv-c_GE||^2 = 1.0000000000000007
||c_pinv-c_pinv||^2 = 0.0
||c_pinv-c_mdl_i||^2 = 0.9999988683985006
||c_pinv-c_polyfit||^2 = 1.000000000000001
--c_mdl_i
||c_mdl_i-c_GE||^2 = 1.1316233160694804e-06
||c_mdl_i-c_pinv||^2 = 0.9999988683985006
||c_mdl_i-c_mdl_i||^2 = 0.0
||c_mdl_i-c_polyfit||^2 = 1.1316233165135605e-06
Why is it? Is it a machine precision thing? Or is it because the error accumulates (a lot) when degree is large (greater than 1)? Honestly I don't know but all those hypothesis seem silly to me. If anyone can spot my mistake, feel free to point it out. Otherwise, I might not understand linear algebra or something...which is a lot more worrying.
Also if I can get suggestions for this to work, it would be highly appreciated. Do I:
increase the size of the interval to no be less than 1 (in magnitude)?
what is the largest polynomial size degree I may use?
different language...? Or increase the precision?
any suggestions are appreciated!
The issue is floating point accuracy. You're raising numbers between zero and one to the 30th power, then adding them together. If you were doing this with infinite precision arithmetic, the methods would recover the inputs. With floating point arithmetic, precision loss means you can't.

Categories