I'm currently doing my numerical analysis homework. I use python to analyze the influence of different parameter's values (which is w in the code) on the backward error in an algorithm. I want to use matplotlib.pyplot to plot a scatter to show the result. But, it seems that the scatter doesn't look like what I want.
As you can see from the figure, the values on y-axis is not ascending from bottom to top, they distribute randomly, and all the points seems like they are at the same line. I've tried a lot of methods to fix it but failed.
Here's the wrong piece of code and data file "SOR2".
import matplotlib.pyplot as plt
import numpy as np
# read SOR2
SOR2 = open("SOR2", 'r')
w = []
e = []
for line in SOR2:
data = line.strip().split()
w.append(data[0])
e.append(data[1])
SOR2.close()
# plot scatter
plt.xlabel("w")
plt.ylabel("backward error")
plt.scatter(w, e)
plt.show()
The data in file "SOR2", the left column is w, and the right column is backward error:
0.50 1.05549
0.51 1.01085
0.52 0.96795
0.53 0.92669
0.54 0.88701
0.55 0.84883
0.56 0.81210
0.57 0.77676
0.58 0.74274
0.59 0.70999
0.60 0.67847
0.61 0.64811
0.62 0.61889
0.63 0.59075
0.64 0.56366
0.65 0.53758
0.66 0.51247
0.67 0.48829
0.68 0.46502
0.69 0.44263
0.70 0.42107
0.71 0.40034
0.72 0.38039
0.73 0.36120
0.74 0.34276
0.75 0.32503
0.76 0.30799
0.77 0.29163
0.78 0.27592
0.79 0.26084
0.80 0.24638
0.81 0.23251
0.82 0.21921
0.83 0.20648
0.84 0.19429
0.85 0.18263
0.86 0.17148
0.87 0.16083
0.88 0.15067
0.89 0.14097
0.90 0.13173
0.91 0.12293
0.92 0.11457
0.93 0.10662
0.94 0.09908
0.95 0.09193
0.96 0.08516
0.97 0.07876
0.98 0.07272
0.99 0.06702
1.00 0.06166
1.01 0.05663
1.02 0.05190
1.03 0.04748
1.04 0.04335
1.05 0.03950
1.06 0.03599
1.07 0.03276
1.08 0.02977
1.09 0.02699
1.10 0.02442
1.11 0.02208
1.12 0.01993
1.13 0.01794
1.14 0.01609
1.15 0.01438
1.16 0.01280
1.17 0.01139
1.18 0.01009
1.19 0.00890
1.20 0.00791
1.21 0.00706
1.22 0.00630
1.23 0.00560
1.24 0.00498
1.25 0.00441
1.26 0.00402
1.27 0.00384
1.28 0.00434
1.29 0.00514
1.30 0.00610
1.31 0.00723
1.32 0.00856
1.33 0.01013
1.34 0.01196
1.35 0.01408
1.36 0.01655
1.37 0.01940
1.38 0.02268
1.39 0.02645
1.40 0.03077
1.41 0.03571
1.42 0.04133
1.43 0.04773
1.44 0.05498
1.45 0.06319
1.46 0.07246
1.47 0.08291
1.48 0.09466
1.49 0.10786
And the result looks like this:
As #krm commented, data needs to be converted to float:
w.append(float(data[0]))
e.append(float(data[1]))
Alternatively you can use pandas to simplify all the parsing and plotting down to 2 lines with pandas.read_fwf() and DataFrame.plot.scatter():
import pandas as pd
df = pd.read_fwf('SOR2', header=None, names=['w', 'e'])
df.plot.scatter(x='w', y='e', ylabel='backward error')
I am dealing with multivariate regression problems.
My dataset is something like X = (nsample, nx) and Y = (nsample, ny).
nx and ny may vary based on different dataset of different case to study, so they should be general in the code.
I would like to determine the coefficients for the multivariate polynomial regression minimizing the root mean square error.
I thought to split the problem in ny different regressions, so for each of them my dataset is X = (nsample, nx) and Y = (nsample, 1). So, for each depended variable (Uj) the second order polynomial has the following form:
I coded the function in python as:
def func(x,nx,pars0,pars1,pars2):
y = pars0 #pars0 = bias
for i in range(nx):
y = y + pars1[i]*x[i] #pars1 linear coeff (beta_i in the equation)
for j in range(nx):
if (j < i ):
continue
y = y + pars2[i,j]*x[i]*x[j]
#diag pars2 = coeff of x^2 (beta_ii in the equation)
#upper triangle pars2 = coeff of x_i*x_k (beta_ik in the equation)
return y
and the root mean square error as:
def resid(nsample,nx,pars0,pars1,pars2,x,y):
res=0.0
for i in range(nsample):
y_pred = func(nx,pars0,pars1,pars2,x[i])
res=res+((y_pred - y[i]) ** 2)
res=res/nsample
res=res**0.5
return res
To determine the coefficients I thought to use scipy.optmize.minimize but it does not work example_1 example_2.
Any ideas or advices? Should I use sklearn?
-> EDIT: Toy test data nx =3, ny =1
0.20 -0.02 0.20 1.0229781
0.20 -0.02 0.40 1.0218807
0.20 -0.02 0.60 1.0220439
0.20 -0.02 0.80 1.0227083
0.20 -0.02 1.00 1.0237960
0.20 -0.02 1.20 1.0255770
0.20 -0.02 1.40 1.0284888
0.20 -0.06 0.20 1.0123552
0.24 -0.02 1.40 1.0295350
0.24 -0.06 0.20 1.0125935
0.24 -0.06 0.40 1.0195798
0.24 -0.06 0.60 1.0124632
0.24 -0.06 0.80 1.0131748
0.24 -0.06 1.00 1.0141751
0.24 -0.06 1.20 1.0153533
0.24 -0.06 1.40 1.0170036
0.24 -0.10 0.20 1.0026915
0.24 -0.10 0.40 1.0058125
0.24 -0.10 0.60 1.0055921
0.24 -0.10 0.80 1.0057868
0.24 -0.10 1.00 1.0014004
0.24 -0.10 1.20 1.0026257
0.24 -0.10 1.40 1.0024578
0.30 -0.18 0.60 0.9748765
0.30 -0.18 0.80 0.9753220
0.30 -0.18 1.00 0.9740970
0.30 -0.18 1.20 0.9727272
0.30 -0.18 1.40 0.9732258
0.30 -0.20 0.20 0.9722360
0.30 -0.20 0.40 0.9687567
0.30 -0.20 0.60 0.9676569
0.30 -0.20 0.80 0.9672319
0.30 -0.20 1.00 0.9682354
0.30 -0.20 1.20 0.9674461
0.30 -0.20 1.40 0.9673747
0.36 -0.02 0.20 1.0272033
0.36 -0.02 0.40 1.0265790
0.36 -0.02 0.60 1.0271688
0.36 -0.02 0.80 1.0277286
0.36 -0.02 1.00 1.0285388
0.36 -0.02 1.20 1.0295619
0.36 -0.02 1.40 1.0310734
0.36 -0.06 0.20 1.0159603
0.36 -0.06 0.40 1.0159753
0.36 -0.06 0.60 1.0161890
0.36 -0.06 0.80 1.0153346
0.36 -0.06 1.00 1.0159790
0.36 -0.06 1.20 1.0167520
0.36 -0.06 1.40 1.0176916
0.36 -0.10 0.20 1.0048287
0.36 -0.10 0.40 1.0034699
0.36 -0.10 0.60 1.0032798
0.36 -0.10 0.80 1.0037224
0.36 -0.10 1.00 1.0059301
0.36 -0.10 1.20 1.0047114
0.36 -0.10 1.40 1.0041287
0.36 -0.14 0.20 0.9926268
0.40 -0.08 0.80 1.0089013
0.40 -0.08 1.20 1.0096265
0.40 -0.08 1.40 1.0103305
0.40 -0.10 0.20 1.0045464
0.40 -0.10 0.40 1.0041031
0.40 -0.10 0.60 1.0035650
0.40 -0.10 0.80 1.0034553
0.40 -0.10 1.00 1.0034699
0.40 -0.10 1.20 1.0030276
0.40 -0.10 1.40 1.0035284
0.40 -0.10 1.60 1.0042166
0.40 -0.14 0.20 0.9924336
0.40 -0.14 0.40 0.9914971
0.40 -0.14 0.60 0.9910082
0.40 -0.14 0.80 0.9903772
0.40 -0.14 1.00 0.9900816
Minimizing error is a huge, complex problem. As such, a lot of very clever people have thought up a lot of cool solutions. Here are a few:
(out of all of them, I think bayesian optimization with sklearn might be a good choice for your use case, though I've never used it)
(also, delete the last "s" in the image url to see the full size)
Random approaches:
genetic algorithms: formats your problem like chromosomes in a genome and "breeds" an optimal solution (a personal favorite of mine)
simulated anealing: formats your problem like hot metal being annealed, which attempts to move to a stable state while losing heat
random search: better than it sounds. randomly tests a verity of input variables.
Grid Search: Simple to implement, but often less effective than methods which employ true randomness (duplicate exploration along particular axis of interest. This strategy often wastes computational resources)
A lot of these come up in hyperparameter optimization for ML models.
More Prescriptive Approaches:
Gradient Descent: uses the gradient calculated in a differentiable function to step toward local minima
DeepAR: uses Bayesian optimization, combined with random search, to reduce loss in hyperparameter tuning. While I believe this is only available on AWS, It looks like sklearn has an implementation of Bayesian optimization
scipy.optimize.minimize: I know you're already using this, but there are 15 different algorithms that can be used by changing the method flag.
The rub
while error minimization is simple conceptually, in practice complex error topologies in high dimensional spaces can be very difficult to traverse efficiently. It harkens to local and global extrema, the explore/exploit problem, and our mathematical understanding of what computational complexity even is. Often, a good error reduction is accomplished through a combination of thorough understanding of the problem, and experimentation with multiple algorithms and hyperparameters. In ML, this is often referred to as hyperparameter tuning, and is a sort of "meta" error reduction step, if you will.
note: feel free to recommend more optimization methods, I'll add them to the list.
I have a example using Simulated Annealing, as mentioned in the nice list in this thread.
First, I need to load the data and define the objective function. I saved your data in data.csv and loaded with
import pandas as pd
data = pd.read_csv("../data.csv", sep=" ", header=None, engine='python')
And fetch your values with
X = data[ [0,1,2] ].values
Y = data[ 3 ].values
I define your poly function with
from itertools import combinations
def poly_function(X, beta):
X_dimension = X.shape[1]
i,j = zip( *list(combinations( range(X_dimension), 2)) )
X_cross = X[:,i] * X[:,j]
X_expanded = np.concatenate([X,X**2,X_cross] , axis=1)
assert X_expanded.shape[1] == beta.shape[0], "Expect beta to be of size {}".format(X_expanded.shape[1])
return np.matmul(X_expanded, beta)
For Simulated Annealing we simply need objective
def obj(beta,X=X,Y=Y):
Y_hat = poly_function(X, beta)
BOOSTER = 10**5
return BOOSTER * np.mean( (Y-Y_hat)**2 )**.5
and some proposals
def small_delta(beta):
new_beta = beta.copy()
random_index = np.random.randint(0,new_beta.shape[0])
new_beta[ random_index ] += (np.random.random() - .5) * .01
return new_beta
def large_delta(beta):
new_beta = beta.copy()
random_index = np.random.randint(0,new_beta.shape[0])
new_beta[ random_index ] += np.random.random() - .5
return new_beta
And random start
def random_beta():
return np.random.random(size=9)
And SA with
import frigidum
local_opt = frigidum.sa(random_start=random_beta,
neighbours=[small_delta, large_delta],
objective_function=obj,
T_start=10**2,
T_stop=10**-12,
repeats=10**3,
copy_state=frigidum.annealing.copy)
The RMSE I found with your data was around 0.026254 with beta
array([ 7.73168440e+00, 2.93929578e+00, 4.10133180e-02, -1.37266444e+01,
-3.43978686e+00, -1.12816177e-02, -1.00262307e+01, -3.12327590e-02,
9.07369588e-02])
where you need to know it is build up as (X1,X2,X3,X1**2, X2**2, X3**2, X1*X2,X1*X3,X2*X3)
A longer run with more repeats can give me a error of 0.026150 with beta
array([ 7.89212770e+00, 3.24138652e+00, 1.24436937e-02, -1.41549553e+01,
-3.31912739e+00, -5.54411310e-03, -1.08317125e+01, 2.09684769e-02,
6.84396750e-02])
You can try the statsmodelslibrary combined with the explanation from this link to fit polynomial models.
https://ostwalprasad.github.io/machine-learning/Polynomial-Regression-using-statsmodel.html
After some trial and error, I finally came up with a solution. The problem can be seen as linear using a change of variables. I used scikit-learn to build the model. After some tests on real cases it works really well
I have a data frame with some quantitative data and one qualitative data. I would like to use describe to compute stats and group by column using the qualitative data. But I do not obtain the order I want for the level. Hereafter is an example:
df = pd.DataFrame({k: np.random.random(10) for k in "ABC"})
df["qual"] = 5 * ["init"] + 5 * ["final"]
The DataFrame looks like:
A B C qual
0 0.298217 0.675818 0.076533 init
1 0.015442 0.264924 0.624483 init
2 0.096961 0.702419 0.027134 init
3 0.481312 0.910477 0.796395 init
4 0.166774 0.319054 0.645250 init
5 0.609148 0.697818 0.151092 final
6 0.715744 0.067429 0.761562 final
7 0.748201 0.803647 0.482738 final
8 0.098323 0.614257 0.232904 final
9 0.033003 0.590819 0.943126 final
Now I would like to group by the qual column and compute statistical descriptors using describe. I did the following:
ddf = df.groupby("qual").describe().transpose()
ddf.unstack(level=0)
And I got
qual final init
A B C A B C
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.440884 0.554794 0.514284 0.211741 0.574539 0.433959
std 0.347138 0.284931 0.338057 0.182946 0.274135 0.355515
min 0.033003 0.067429 0.151092 0.015442 0.264924 0.027134
25% 0.098323 0.590819 0.232904 0.096961 0.319054 0.076533
50% 0.609148 0.614257 0.482738 0.166774 0.675818 0.624483
75% 0.715744 0.697818 0.761562 0.298217 0.702419 0.645250
max 0.748201 0.803647 0.943126 0.481312 0.910477 0.796395
I am close to what I want but I would like to swap and group the column index such as:
A B C
qual initial final initial final initial final
Is there a way to do it ?
Use columns.swaplevel and then sort_index by level=0 and axis='columns':
ddf = df.groupby('qual').describe().T.unstack(level=0)
ddf.columns = ddf.columns.swaplevel(0,1)
ddf = ddf.sort_index(level=0, axis='columns')
Or in one line using DataFrame.swaplevel instead of index.swaplevel:
ddf = ddf.swaplevel(0,1, axis=1).sort_index(level=0, axis='columns')
A B C
qual final init final init final init
count 5.00 5.00 5.00 5.00 5.00 5.00
mean 0.44 0.21 0.55 0.57 0.51 0.43
std 0.35 0.18 0.28 0.27 0.34 0.36
min 0.03 0.02 0.07 0.26 0.15 0.03
25% 0.10 0.10 0.59 0.32 0.23 0.08
50% 0.61 0.17 0.61 0.68 0.48 0.62
75% 0.72 0.30 0.70 0.70 0.76 0.65
max 0.75 0.48 0.80 0.91 0.94 0.80
Try ddf.stack().unstack(level=[0,2]), inplace of ddf.unstack(level=0)
I'm trying to fit a set of data taken by an external simulation, and stored in a vector, with the Lmfit library.
Below there's my code:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
from lmfit import Parameters
def DGauss3Par(x,I1,sigma1,sigma2):
I2 = 2.63 - I1
return (I1/np.sqrt(2*np.pi*sigma1))*np.exp(-(x*x)/(2*sigma1*sigma1)) + (I2/np.sqrt(2*np.pi*sigma2))*np.exp(-(x*x)/(2*sigma2*sigma2))
#TAKE DATA
xFull = []
yFull = []
fileTypex = np.dtype([('xFull', np.float)])
fileTypey = np.dtype([('yFull', np.float)])
fDatax = "xValue.dat"
fDatay = "yValue.dat"
xFull = np.loadtxt(fDatax, dtype=fileTypex)
yFull = np.loadtxt(fDatay, dtype=fileTypey)
xGauss = xFull[:]["xFull"]
yGauss = yFull[:]["yFull"]
#MODEL'S DEFINITION
gmodel = Model(DGauss3Par)
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
#PLOTS
plt.plot(xGauss, result3.best_fit, 'y-')
plt.show()
When I run it, I get this error:
File "Overlap.py", line 133, in <module>
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
ValueError: The input contains nan values
These are the values of the data contained in the vector xGauss (related to the x axis):
[-3.88 -3.28 -3.13 -3.08 -3.03 -2.98 -2.93 -2.88 -2.83 -2.78 -2.73 -2.68
-2.63 -2.58 -2.53 -2.48 -2.43 -2.38 -2.33 -2.28 -2.23 -2.18 -2.13 -2.08
-2.03 -1.98 -1.93 -1.88 -1.83 -1.78 -1.73 -1.68 -1.63 -1.58 -1.53 -1.48
-1.43 -1.38 -1.33 -1.28 -1.23 -1.18 -1.13 -1.08 -1.03 -0.98 -0.93 -0.88
-0.83 -0.78 -0.73 -0.68 -0.63 -0.58 -0.53 -0.48 -0.43 -0.38 -0.33 -0.28
-0.23 -0.18 -0.13 -0.08 -0.03 0.03 0.08 0.13 0.18 0.23 0.28 0.33
0.38 0.43 0.48 0.53 0.58 0.63 0.68 0.73 0.78 0.83 0.88 0.93
0.98 1.03 1.08 1.13 1.18 1.23 1.28 1.33 1.38 1.43 1.48 1.53
1.58 1.63 1.68 1.73 1.78 1.83 1.88 1.93 1.98 2.03 2.08 2.13
2.18 2.23 2.28 2.33 2.38 2.43 2.48 2.53 2.58 2.63 2.68 2.73
2.78 2.83 2.88 2.93 2.98 3.03 3.08 3.13 3.28 3.88]
And these ones the ones in the vector yGauss (related to y axis):
[0.00173977 0.00986279 0.01529543 0.0242624 0.0287456 0.03238484
0.03285927 0.03945234 0.04615091 0.05701618 0.0637672 0.07194268
0.07763934 0.08565687 0.09615262 0.1043281 0.11350606 0.1199406
0.1260062 0.14093328 0.15079665 0.16651464 0.18065023 0.1938894
0.2047541 0.21794024 0.22806706 0.23793043 0.25164404 0.2635118
0.28075974 0.29568682 0.30871501 0.3311846 0.34648062 0.36984661
0.38540666 0.40618835 0.4283945 0.45002014 0.48303911 0.50746062
0.53167057 0.5548792 0.57835128 0.60256181 0.62566436 0.65704847
0.68289386 0.71332794 0.73258027 0.769608 0.78769989 0.81407275
0.83358852 0.85210239 0.87109068 0.89456217 0.91618782 0.93760247
0.95680234 0.96919757 0.9783219 0.98486193 0.9931429 0.9931429
0.98486193 0.9783219 0.96919757 0.95680234 0.93760247 0.91618782
0.89456217 0.87109068 0.85210239 0.83358852 0.81407275 0.78769989
0.769608 0.73258027 0.71332794 0.68289386 0.65704847 0.62566436
0.60256181 0.57835128 0.5548792 0.53167057 0.50746062 0.48303911
0.45002014 0.4283945 0.40618835 0.38540666 0.36984661 0.34648062
0.3311846 0.30871501 0.29568682 0.28075974 0.2635118 0.25164404
0.23793043 0.22806706 0.21794024 0.2047541 0.1938894 0.18065023
0.16651464 0.15079665 0.14093328 0.1260062 0.1199406 0.11350606
0.1043281 0.09615262 0.08565687 0.07763934 0.07194268 0.0637672
0.05701618 0.04615091 0.03945234 0.03285927 0.03238484 0.0287456
0.0242624 0.01529543 0.00986279 0.00173977]
I've also tried to print the values returned by my function, to see if there really were some NaN values:
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
func = DGauss3Par(xGauss,I1,sigma1,sigma2)
print func
but what I obtained is:
[0.04835225 0.06938855 0.07735839 0.08040181 0.08366964 0.08718237
0.09096169 0.09503048 0.0994128 0.10413374 0.10921938 0.11469669
0.12059333 0.12693754 0.13375795 0.14108333 0.14894236 0.15736337
0.16637406 0.17600115 0.18627003 0.19720444 0.20882607 0.22115413
0.23420498 0.24799173 0.26252377 0.27780639 0.29384037 0.3106216
0.32814069 0.34638266 0.3653266 0.38494543 0.40520569 0.42606735
0.44748374 0.46940149 0.49176057 0.51449442 0.5375301 0.56078857
0.58418507 0.60762948 0.63102687 0.65427809 0.6772804 0.69992818
0.72211377 0.74372824 0.76466232 0.78480729 0.80405595 0.82230355
0.83944875 0.85539458 0.87004937 0.88332762 0.89515085 0.90544838
0.91415806 0.92122688 0.92661155 0.93027889 0.93220625 0.93220625
0.93027889 0.92661155 0.92122688 0.91415806 0.90544838 0.89515085
0.88332762 0.87004937 0.85539458 0.83944875 0.82230355 0.80405595
0.78480729 0.76466232 0.74372824 0.72211377 0.69992818 0.6772804
0.65427809 0.63102687 0.60762948 0.58418507 0.56078857 0.5375301
0.51449442 0.49176057 0.46940149 0.44748374 0.42606735 0.40520569
0.38494543 0.3653266 0.34638266 0.32814069 0.3106216 0.29384037
0.27780639 0.26252377 0.24799173 0.23420498 0.22115413 0.20882607
0.19720444 0.18627003 0.17600115 0.16637406 0.15736337 0.14894236
0.14108333 0.13375795 0.12693754 0.12059333 0.11469669 0.10921938
0.10413374 0.0994128 0.09503048 0.09096169 0.08718237 0.08366964
0.08040181 0.07735839 0.06938855 0.04835225]
So it doesn't seems that there are NaN values, I'm not understanding for which reason it returns me that error.
Could anyone help me, please? Thanks!
If you add a print function to your fit function, printing out sigma1 and sigma2, you'll find that
DGauss3Par is evaluated already a few times before the error occurs.
Both sigma variables have a negative value at the time the error occurs.
Taking the square root of a negative value causes, of course, a NaN.
You should add a min bound or similar to your sigma1 and sigma2 parameters to prevent this. Using min=0.0 as an additional argument to params.add(...) will result in a good fit.
Be aware that for some analyses, setting explicit bounds to your fitting parameters may make these analyses invalid. For most cases, you'll be fine, but for some cases, you'll need to check whether the fitting parameters should be allowed to vary from negative infinity to positive infinity, or are allowed to be bounded.
I am new to Python and I am trying to build a function to run some statistics on a data set. The data is in an Excel format and it contains 7 rows, with the first row I know what a function is and how it should be built, nevertheless I can't figure it out how to build this function.
This is the function:
def st_dev(benchmark, factor):
benchmark = mkt_ret
factor = smb
statistics = st.stdev(benchmark, factor)
return statistics
print(st_dev)
And this is the result:
Mkt-RF SMB HML RMW CMA RF
196307 -0.39 -0.46 -0.81 0.72 -1.16 0.27
196308 5.07 -0.81 1.65 0.42 -0.4 0.25
196309 -1.57 -0.48 0.19 -0.8 0.23 0.27
196310 2.53 -1.29 -0.09 2.75 -2.26 0.29
196311 -0.85 -0.85 1.71 -0.34 2.22 0.27
4.38
<function st_dev at 0x0000000002D92F28>
Process finished with exit code 0
the full code can be viewed here.
I tried several versions to write the function, some error messages told me that I cannot convert 'Series' to numerator/denominator.
I am running python 3.7
Thank you for your help.
Alex