I am trying to use curve_fit to fit some data. it is working great, I would just like to improve the fit with additional parameters to match assumptions (such as mechanical efficiency cannot be greater than 100% etc)
y_data = [0.90 0.90 0.90 0.90 0.90 0.90 0.90 1.30 1.30 1.30 1.30 1.20 1.65 1.65 1.65 1.65 1.65 1.65 1.80 1.80 1.80 1.80 1.80 1.80 1.80 1.80 1.80 3.50 6.60 6.60 6.70 6.70 6.70 6.70 6.70 8.50 12.70] # I am aware this does not have commas
x_data = [0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.46 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02] # ditto
def poly2(x, a, b, c): return a*x**2+ b*x+c
def poly3(x,a,b,c,d): return a*x**3+b*x**2+c*b*x+d
pars = fit(poly2, x_data, y_data, bounds=bounds)
But I would like to additionally specify bounds to relations between parameters eg.
B**2 -4*a*c > 0 #for poly2
b**2-3*a*c=0 #for poly3
To ensure that the fit has horizontal inflection.
Is there a way to achieve this?
Edit: I found this, it may help once I investigate:How do I put a constraint on SciPy curve fit?
How would this be done using lmfit as suggested?
So I believe I have solved this, based on #9dogs comment using lmfit.
relevant documentation here:
https://lmfit.github.io/lmfit-py/constraints.html
and a helpful tutorial here:
http://blog.danallan.com/projects/2013/model/
For my function poly3 this seams to work to enforce horizontal or positive inflection.
from lmfit import Parameters, Model
def poly3(x,a,b,c,d): return a*x**3+b*x**2+c*b*x+d
model = Model(poly3, independent_vars=['x'], )
params = Parameters()
apologies for teh terrible maths: the cubic dicriminant is given here as https://brilliant.org/wiki/cubic-discriminant/ b**2*c**2-4*a*c**3-4*b**3*d-27*a**2*d**2+18*a*b*c*d
params = Parameters()
params..add('a', value=1, min=0, vary=True)
params.add('b', value=1, vary=True)
params.add('c', value=1, vary=True)
params.add('d', value=1, vary=True)
params.add('discr', value = 0, vary= False, expr='(b**2*c**2-4*a*c**3-4*b**3*d-27*a**2*d**2+18*a*b*c*d)')
result = model.fit(y_data, x=x_data, params=params) # do the work
pars = [] # list that will contain the optimized parameters for analysis
# create a parameters list for use in the rest of code, this is a stopgap until I refactor the rest of my code
pars.append(result.values['a'])
pars.append(result.values['b'])
pars.append(result.values['c'])
pars.append(result.values['d'])
## rest of code such as plotting
If there are questions I will expand the example further.
Related
I'm currently doing my numerical analysis homework. I use python to analyze the influence of different parameter's values (which is w in the code) on the backward error in an algorithm. I want to use matplotlib.pyplot to plot a scatter to show the result. But, it seems that the scatter doesn't look like what I want.
As you can see from the figure, the values on y-axis is not ascending from bottom to top, they distribute randomly, and all the points seems like they are at the same line. I've tried a lot of methods to fix it but failed.
Here's the wrong piece of code and data file "SOR2".
import matplotlib.pyplot as plt
import numpy as np
# read SOR2
SOR2 = open("SOR2", 'r')
w = []
e = []
for line in SOR2:
data = line.strip().split()
w.append(data[0])
e.append(data[1])
SOR2.close()
# plot scatter
plt.xlabel("w")
plt.ylabel("backward error")
plt.scatter(w, e)
plt.show()
The data in file "SOR2", the left column is w, and the right column is backward error:
0.50 1.05549
0.51 1.01085
0.52 0.96795
0.53 0.92669
0.54 0.88701
0.55 0.84883
0.56 0.81210
0.57 0.77676
0.58 0.74274
0.59 0.70999
0.60 0.67847
0.61 0.64811
0.62 0.61889
0.63 0.59075
0.64 0.56366
0.65 0.53758
0.66 0.51247
0.67 0.48829
0.68 0.46502
0.69 0.44263
0.70 0.42107
0.71 0.40034
0.72 0.38039
0.73 0.36120
0.74 0.34276
0.75 0.32503
0.76 0.30799
0.77 0.29163
0.78 0.27592
0.79 0.26084
0.80 0.24638
0.81 0.23251
0.82 0.21921
0.83 0.20648
0.84 0.19429
0.85 0.18263
0.86 0.17148
0.87 0.16083
0.88 0.15067
0.89 0.14097
0.90 0.13173
0.91 0.12293
0.92 0.11457
0.93 0.10662
0.94 0.09908
0.95 0.09193
0.96 0.08516
0.97 0.07876
0.98 0.07272
0.99 0.06702
1.00 0.06166
1.01 0.05663
1.02 0.05190
1.03 0.04748
1.04 0.04335
1.05 0.03950
1.06 0.03599
1.07 0.03276
1.08 0.02977
1.09 0.02699
1.10 0.02442
1.11 0.02208
1.12 0.01993
1.13 0.01794
1.14 0.01609
1.15 0.01438
1.16 0.01280
1.17 0.01139
1.18 0.01009
1.19 0.00890
1.20 0.00791
1.21 0.00706
1.22 0.00630
1.23 0.00560
1.24 0.00498
1.25 0.00441
1.26 0.00402
1.27 0.00384
1.28 0.00434
1.29 0.00514
1.30 0.00610
1.31 0.00723
1.32 0.00856
1.33 0.01013
1.34 0.01196
1.35 0.01408
1.36 0.01655
1.37 0.01940
1.38 0.02268
1.39 0.02645
1.40 0.03077
1.41 0.03571
1.42 0.04133
1.43 0.04773
1.44 0.05498
1.45 0.06319
1.46 0.07246
1.47 0.08291
1.48 0.09466
1.49 0.10786
And the result looks like this:
As #krm commented, data needs to be converted to float:
w.append(float(data[0]))
e.append(float(data[1]))
Alternatively you can use pandas to simplify all the parsing and plotting down to 2 lines with pandas.read_fwf() and DataFrame.plot.scatter():
import pandas as pd
df = pd.read_fwf('SOR2', header=None, names=['w', 'e'])
df.plot.scatter(x='w', y='e', ylabel='backward error')
I am dealing with multivariate regression problems.
My dataset is something like X = (nsample, nx) and Y = (nsample, ny).
nx and ny may vary based on different dataset of different case to study, so they should be general in the code.
I would like to determine the coefficients for the multivariate polynomial regression minimizing the root mean square error.
I thought to split the problem in ny different regressions, so for each of them my dataset is X = (nsample, nx) and Y = (nsample, 1). So, for each depended variable (Uj) the second order polynomial has the following form:
I coded the function in python as:
def func(x,nx,pars0,pars1,pars2):
y = pars0 #pars0 = bias
for i in range(nx):
y = y + pars1[i]*x[i] #pars1 linear coeff (beta_i in the equation)
for j in range(nx):
if (j < i ):
continue
y = y + pars2[i,j]*x[i]*x[j]
#diag pars2 = coeff of x^2 (beta_ii in the equation)
#upper triangle pars2 = coeff of x_i*x_k (beta_ik in the equation)
return y
and the root mean square error as:
def resid(nsample,nx,pars0,pars1,pars2,x,y):
res=0.0
for i in range(nsample):
y_pred = func(nx,pars0,pars1,pars2,x[i])
res=res+((y_pred - y[i]) ** 2)
res=res/nsample
res=res**0.5
return res
To determine the coefficients I thought to use scipy.optmize.minimize but it does not work example_1 example_2.
Any ideas or advices? Should I use sklearn?
-> EDIT: Toy test data nx =3, ny =1
0.20 -0.02 0.20 1.0229781
0.20 -0.02 0.40 1.0218807
0.20 -0.02 0.60 1.0220439
0.20 -0.02 0.80 1.0227083
0.20 -0.02 1.00 1.0237960
0.20 -0.02 1.20 1.0255770
0.20 -0.02 1.40 1.0284888
0.20 -0.06 0.20 1.0123552
0.24 -0.02 1.40 1.0295350
0.24 -0.06 0.20 1.0125935
0.24 -0.06 0.40 1.0195798
0.24 -0.06 0.60 1.0124632
0.24 -0.06 0.80 1.0131748
0.24 -0.06 1.00 1.0141751
0.24 -0.06 1.20 1.0153533
0.24 -0.06 1.40 1.0170036
0.24 -0.10 0.20 1.0026915
0.24 -0.10 0.40 1.0058125
0.24 -0.10 0.60 1.0055921
0.24 -0.10 0.80 1.0057868
0.24 -0.10 1.00 1.0014004
0.24 -0.10 1.20 1.0026257
0.24 -0.10 1.40 1.0024578
0.30 -0.18 0.60 0.9748765
0.30 -0.18 0.80 0.9753220
0.30 -0.18 1.00 0.9740970
0.30 -0.18 1.20 0.9727272
0.30 -0.18 1.40 0.9732258
0.30 -0.20 0.20 0.9722360
0.30 -0.20 0.40 0.9687567
0.30 -0.20 0.60 0.9676569
0.30 -0.20 0.80 0.9672319
0.30 -0.20 1.00 0.9682354
0.30 -0.20 1.20 0.9674461
0.30 -0.20 1.40 0.9673747
0.36 -0.02 0.20 1.0272033
0.36 -0.02 0.40 1.0265790
0.36 -0.02 0.60 1.0271688
0.36 -0.02 0.80 1.0277286
0.36 -0.02 1.00 1.0285388
0.36 -0.02 1.20 1.0295619
0.36 -0.02 1.40 1.0310734
0.36 -0.06 0.20 1.0159603
0.36 -0.06 0.40 1.0159753
0.36 -0.06 0.60 1.0161890
0.36 -0.06 0.80 1.0153346
0.36 -0.06 1.00 1.0159790
0.36 -0.06 1.20 1.0167520
0.36 -0.06 1.40 1.0176916
0.36 -0.10 0.20 1.0048287
0.36 -0.10 0.40 1.0034699
0.36 -0.10 0.60 1.0032798
0.36 -0.10 0.80 1.0037224
0.36 -0.10 1.00 1.0059301
0.36 -0.10 1.20 1.0047114
0.36 -0.10 1.40 1.0041287
0.36 -0.14 0.20 0.9926268
0.40 -0.08 0.80 1.0089013
0.40 -0.08 1.20 1.0096265
0.40 -0.08 1.40 1.0103305
0.40 -0.10 0.20 1.0045464
0.40 -0.10 0.40 1.0041031
0.40 -0.10 0.60 1.0035650
0.40 -0.10 0.80 1.0034553
0.40 -0.10 1.00 1.0034699
0.40 -0.10 1.20 1.0030276
0.40 -0.10 1.40 1.0035284
0.40 -0.10 1.60 1.0042166
0.40 -0.14 0.20 0.9924336
0.40 -0.14 0.40 0.9914971
0.40 -0.14 0.60 0.9910082
0.40 -0.14 0.80 0.9903772
0.40 -0.14 1.00 0.9900816
Minimizing error is a huge, complex problem. As such, a lot of very clever people have thought up a lot of cool solutions. Here are a few:
(out of all of them, I think bayesian optimization with sklearn might be a good choice for your use case, though I've never used it)
(also, delete the last "s" in the image url to see the full size)
Random approaches:
genetic algorithms: formats your problem like chromosomes in a genome and "breeds" an optimal solution (a personal favorite of mine)
simulated anealing: formats your problem like hot metal being annealed, which attempts to move to a stable state while losing heat
random search: better than it sounds. randomly tests a verity of input variables.
Grid Search: Simple to implement, but often less effective than methods which employ true randomness (duplicate exploration along particular axis of interest. This strategy often wastes computational resources)
A lot of these come up in hyperparameter optimization for ML models.
More Prescriptive Approaches:
Gradient Descent: uses the gradient calculated in a differentiable function to step toward local minima
DeepAR: uses Bayesian optimization, combined with random search, to reduce loss in hyperparameter tuning. While I believe this is only available on AWS, It looks like sklearn has an implementation of Bayesian optimization
scipy.optimize.minimize: I know you're already using this, but there are 15 different algorithms that can be used by changing the method flag.
The rub
while error minimization is simple conceptually, in practice complex error topologies in high dimensional spaces can be very difficult to traverse efficiently. It harkens to local and global extrema, the explore/exploit problem, and our mathematical understanding of what computational complexity even is. Often, a good error reduction is accomplished through a combination of thorough understanding of the problem, and experimentation with multiple algorithms and hyperparameters. In ML, this is often referred to as hyperparameter tuning, and is a sort of "meta" error reduction step, if you will.
note: feel free to recommend more optimization methods, I'll add them to the list.
I have a example using Simulated Annealing, as mentioned in the nice list in this thread.
First, I need to load the data and define the objective function. I saved your data in data.csv and loaded with
import pandas as pd
data = pd.read_csv("../data.csv", sep=" ", header=None, engine='python')
And fetch your values with
X = data[ [0,1,2] ].values
Y = data[ 3 ].values
I define your poly function with
from itertools import combinations
def poly_function(X, beta):
X_dimension = X.shape[1]
i,j = zip( *list(combinations( range(X_dimension), 2)) )
X_cross = X[:,i] * X[:,j]
X_expanded = np.concatenate([X,X**2,X_cross] , axis=1)
assert X_expanded.shape[1] == beta.shape[0], "Expect beta to be of size {}".format(X_expanded.shape[1])
return np.matmul(X_expanded, beta)
For Simulated Annealing we simply need objective
def obj(beta,X=X,Y=Y):
Y_hat = poly_function(X, beta)
BOOSTER = 10**5
return BOOSTER * np.mean( (Y-Y_hat)**2 )**.5
and some proposals
def small_delta(beta):
new_beta = beta.copy()
random_index = np.random.randint(0,new_beta.shape[0])
new_beta[ random_index ] += (np.random.random() - .5) * .01
return new_beta
def large_delta(beta):
new_beta = beta.copy()
random_index = np.random.randint(0,new_beta.shape[0])
new_beta[ random_index ] += np.random.random() - .5
return new_beta
And random start
def random_beta():
return np.random.random(size=9)
And SA with
import frigidum
local_opt = frigidum.sa(random_start=random_beta,
neighbours=[small_delta, large_delta],
objective_function=obj,
T_start=10**2,
T_stop=10**-12,
repeats=10**3,
copy_state=frigidum.annealing.copy)
The RMSE I found with your data was around 0.026254 with beta
array([ 7.73168440e+00, 2.93929578e+00, 4.10133180e-02, -1.37266444e+01,
-3.43978686e+00, -1.12816177e-02, -1.00262307e+01, -3.12327590e-02,
9.07369588e-02])
where you need to know it is build up as (X1,X2,X3,X1**2, X2**2, X3**2, X1*X2,X1*X3,X2*X3)
A longer run with more repeats can give me a error of 0.026150 with beta
array([ 7.89212770e+00, 3.24138652e+00, 1.24436937e-02, -1.41549553e+01,
-3.31912739e+00, -5.54411310e-03, -1.08317125e+01, 2.09684769e-02,
6.84396750e-02])
You can try the statsmodelslibrary combined with the explanation from this link to fit polynomial models.
https://ostwalprasad.github.io/machine-learning/Polynomial-Regression-using-statsmodel.html
After some trial and error, I finally came up with a solution. The problem can be seen as linear using a change of variables. I used scikit-learn to build the model. After some tests on real cases it works really well
I'm trying to fit a set of data taken by an external simulation, and stored in a vector, with the Lmfit library.
Below there's my code:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
from lmfit import Parameters
def DGauss3Par(x,I1,sigma1,sigma2):
I2 = 2.63 - I1
return (I1/np.sqrt(2*np.pi*sigma1))*np.exp(-(x*x)/(2*sigma1*sigma1)) + (I2/np.sqrt(2*np.pi*sigma2))*np.exp(-(x*x)/(2*sigma2*sigma2))
#TAKE DATA
xFull = []
yFull = []
fileTypex = np.dtype([('xFull', np.float)])
fileTypey = np.dtype([('yFull', np.float)])
fDatax = "xValue.dat"
fDatay = "yValue.dat"
xFull = np.loadtxt(fDatax, dtype=fileTypex)
yFull = np.loadtxt(fDatay, dtype=fileTypey)
xGauss = xFull[:]["xFull"]
yGauss = yFull[:]["yFull"]
#MODEL'S DEFINITION
gmodel = Model(DGauss3Par)
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
#PLOTS
plt.plot(xGauss, result3.best_fit, 'y-')
plt.show()
When I run it, I get this error:
File "Overlap.py", line 133, in <module>
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
ValueError: The input contains nan values
These are the values of the data contained in the vector xGauss (related to the x axis):
[-3.88 -3.28 -3.13 -3.08 -3.03 -2.98 -2.93 -2.88 -2.83 -2.78 -2.73 -2.68
-2.63 -2.58 -2.53 -2.48 -2.43 -2.38 -2.33 -2.28 -2.23 -2.18 -2.13 -2.08
-2.03 -1.98 -1.93 -1.88 -1.83 -1.78 -1.73 -1.68 -1.63 -1.58 -1.53 -1.48
-1.43 -1.38 -1.33 -1.28 -1.23 -1.18 -1.13 -1.08 -1.03 -0.98 -0.93 -0.88
-0.83 -0.78 -0.73 -0.68 -0.63 -0.58 -0.53 -0.48 -0.43 -0.38 -0.33 -0.28
-0.23 -0.18 -0.13 -0.08 -0.03 0.03 0.08 0.13 0.18 0.23 0.28 0.33
0.38 0.43 0.48 0.53 0.58 0.63 0.68 0.73 0.78 0.83 0.88 0.93
0.98 1.03 1.08 1.13 1.18 1.23 1.28 1.33 1.38 1.43 1.48 1.53
1.58 1.63 1.68 1.73 1.78 1.83 1.88 1.93 1.98 2.03 2.08 2.13
2.18 2.23 2.28 2.33 2.38 2.43 2.48 2.53 2.58 2.63 2.68 2.73
2.78 2.83 2.88 2.93 2.98 3.03 3.08 3.13 3.28 3.88]
And these ones the ones in the vector yGauss (related to y axis):
[0.00173977 0.00986279 0.01529543 0.0242624 0.0287456 0.03238484
0.03285927 0.03945234 0.04615091 0.05701618 0.0637672 0.07194268
0.07763934 0.08565687 0.09615262 0.1043281 0.11350606 0.1199406
0.1260062 0.14093328 0.15079665 0.16651464 0.18065023 0.1938894
0.2047541 0.21794024 0.22806706 0.23793043 0.25164404 0.2635118
0.28075974 0.29568682 0.30871501 0.3311846 0.34648062 0.36984661
0.38540666 0.40618835 0.4283945 0.45002014 0.48303911 0.50746062
0.53167057 0.5548792 0.57835128 0.60256181 0.62566436 0.65704847
0.68289386 0.71332794 0.73258027 0.769608 0.78769989 0.81407275
0.83358852 0.85210239 0.87109068 0.89456217 0.91618782 0.93760247
0.95680234 0.96919757 0.9783219 0.98486193 0.9931429 0.9931429
0.98486193 0.9783219 0.96919757 0.95680234 0.93760247 0.91618782
0.89456217 0.87109068 0.85210239 0.83358852 0.81407275 0.78769989
0.769608 0.73258027 0.71332794 0.68289386 0.65704847 0.62566436
0.60256181 0.57835128 0.5548792 0.53167057 0.50746062 0.48303911
0.45002014 0.4283945 0.40618835 0.38540666 0.36984661 0.34648062
0.3311846 0.30871501 0.29568682 0.28075974 0.2635118 0.25164404
0.23793043 0.22806706 0.21794024 0.2047541 0.1938894 0.18065023
0.16651464 0.15079665 0.14093328 0.1260062 0.1199406 0.11350606
0.1043281 0.09615262 0.08565687 0.07763934 0.07194268 0.0637672
0.05701618 0.04615091 0.03945234 0.03285927 0.03238484 0.0287456
0.0242624 0.01529543 0.00986279 0.00173977]
I've also tried to print the values returned by my function, to see if there really were some NaN values:
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
func = DGauss3Par(xGauss,I1,sigma1,sigma2)
print func
but what I obtained is:
[0.04835225 0.06938855 0.07735839 0.08040181 0.08366964 0.08718237
0.09096169 0.09503048 0.0994128 0.10413374 0.10921938 0.11469669
0.12059333 0.12693754 0.13375795 0.14108333 0.14894236 0.15736337
0.16637406 0.17600115 0.18627003 0.19720444 0.20882607 0.22115413
0.23420498 0.24799173 0.26252377 0.27780639 0.29384037 0.3106216
0.32814069 0.34638266 0.3653266 0.38494543 0.40520569 0.42606735
0.44748374 0.46940149 0.49176057 0.51449442 0.5375301 0.56078857
0.58418507 0.60762948 0.63102687 0.65427809 0.6772804 0.69992818
0.72211377 0.74372824 0.76466232 0.78480729 0.80405595 0.82230355
0.83944875 0.85539458 0.87004937 0.88332762 0.89515085 0.90544838
0.91415806 0.92122688 0.92661155 0.93027889 0.93220625 0.93220625
0.93027889 0.92661155 0.92122688 0.91415806 0.90544838 0.89515085
0.88332762 0.87004937 0.85539458 0.83944875 0.82230355 0.80405595
0.78480729 0.76466232 0.74372824 0.72211377 0.69992818 0.6772804
0.65427809 0.63102687 0.60762948 0.58418507 0.56078857 0.5375301
0.51449442 0.49176057 0.46940149 0.44748374 0.42606735 0.40520569
0.38494543 0.3653266 0.34638266 0.32814069 0.3106216 0.29384037
0.27780639 0.26252377 0.24799173 0.23420498 0.22115413 0.20882607
0.19720444 0.18627003 0.17600115 0.16637406 0.15736337 0.14894236
0.14108333 0.13375795 0.12693754 0.12059333 0.11469669 0.10921938
0.10413374 0.0994128 0.09503048 0.09096169 0.08718237 0.08366964
0.08040181 0.07735839 0.06938855 0.04835225]
So it doesn't seems that there are NaN values, I'm not understanding for which reason it returns me that error.
Could anyone help me, please? Thanks!
If you add a print function to your fit function, printing out sigma1 and sigma2, you'll find that
DGauss3Par is evaluated already a few times before the error occurs.
Both sigma variables have a negative value at the time the error occurs.
Taking the square root of a negative value causes, of course, a NaN.
You should add a min bound or similar to your sigma1 and sigma2 parameters to prevent this. Using min=0.0 as an additional argument to params.add(...) will result in a good fit.
Be aware that for some analyses, setting explicit bounds to your fitting parameters may make these analyses invalid. For most cases, you'll be fine, but for some cases, you'll need to check whether the fitting parameters should be allowed to vary from negative infinity to positive infinity, or are allowed to be bounded.
I am new to Python and I am trying to build a function to run some statistics on a data set. The data is in an Excel format and it contains 7 rows, with the first row I know what a function is and how it should be built, nevertheless I can't figure it out how to build this function.
This is the function:
def st_dev(benchmark, factor):
benchmark = mkt_ret
factor = smb
statistics = st.stdev(benchmark, factor)
return statistics
print(st_dev)
And this is the result:
Mkt-RF SMB HML RMW CMA RF
196307 -0.39 -0.46 -0.81 0.72 -1.16 0.27
196308 5.07 -0.81 1.65 0.42 -0.4 0.25
196309 -1.57 -0.48 0.19 -0.8 0.23 0.27
196310 2.53 -1.29 -0.09 2.75 -2.26 0.29
196311 -0.85 -0.85 1.71 -0.34 2.22 0.27
4.38
<function st_dev at 0x0000000002D92F28>
Process finished with exit code 0
the full code can be viewed here.
I tried several versions to write the function, some error messages told me that I cannot convert 'Series' to numerator/denominator.
I am running python 3.7
Thank you for your help.
Alex
Consider the simple process of reading a data file with some non-valid entries. This is my test.dat file:
16 1035.22 1041.09 24.54 0.30 1.39 0.30 1.80 0.30 2.26 0.30 1.14 0.30 0.28 0.30 0.2884
127 824.57 1105.52 25.02 0.29 0.87 0.29 1.30 0.29 2.12 0.29 0.66 0.29 0.10 0.29 0.2986
182 1015.83 904.93 INDEF 0.28 1.80 0.28 1.64 0.28 2.38 0.28 1.04 0.28 0.06 0.28 0.3271
185 1019.15 1155.09 24.31 0.28 1.40 0.28 1.78 0.28 2.10 0.28 0.87 0.28 0.35 0.28 0.3290
192 1024.80 1045.57 24.27 0.27 1.24 0.27 2.01 0.27 2.40 0.27 0.90 0.27 0.09 0.27 0.3328
197 1035.99 876.04 24.10 0.27 1.23 0.27 1.52 0.27 2.59 0.27 0.45 0.27 0.25 0.27 0.3357
198 1110.80 1087.97 24.53 0.27 1.49 0.27 1.71 0.27 2.33 0.27 0.22 0.27 0.00 0.27 0.3362
1103 1168.39 1065.97 24.35 0.27 1.28 0.27 1.29 0.27 2.68 0.27 0.43 0.27 0.26 0.27 0.3388
And this is the code to read it, and replace the "bad" values (INDEF) with a float (99.999)
import numpy as np
from astropy.io import ascii
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = data.filled(99.999)
This works just fine, but if I instead try to replace the bad values with a np.nan (i.e., I use the line data = data.filled(np.nan)) I get:
ValueError: cannot convert float NaN to integer
why is this and how can I get around it?
As noted the issue is that the numpy MaskedArray.filled() method seems to try converting the fill value to the appropriate type before checking if there is actually anything to fill. Since the table in the example has an int column, this fails within numpy (and astropy.Table is just calling the filled() method on each column).
This should work:
In [44]: def fill_cols(tbl, fill=np.nan, kind='f'):
...: """
...: In-place fill of ``tbl`` columns which have dtype ``kind``
...: with ``fill`` value.
...: """
...: for col in tbl.itercols():
...: if col.dtype.kind == kind:
...: col[...] = col.filled(fill)
...:
In [45]: t = simple_table(masked=True)
In [46]: t
Out[46]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 -- e
In [47]: fill_cols(t)
In [48]: t
Out[48]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 nan e
I don't think it's primarily a numpy problem, as it works with individual columns:
>>> data['col4'].filled(np.nan)
<Column name='col4' dtype='float64' length=8>
24.54
25.02
nan
24.31
24.27
24.1
24.53
24.35
but you still can't construct a Table from this -
Table([data[n].filled(np.nan) for n in data.colnames])
raises the same error in np.ma.core.
You can explicitly set
data['col4'] = data['col4'].filled(np.nan)
but this apparently lets the table lose its .filled() method...
I am not that familiar with masked arrays and tables, but as you've already filed a related issue on Github, you might want to add this problem.
This is happening fairly deep in numpy, in numpy.ma.filled. fill values have to be scalars, basically.
A messy solution that fills with nan's and still returns a table could look like:
import numpy as np
from astropy.io import ascii
from astropy.table import Table
def fill_with_nan(t):
arr = t.as_array()
arr_list = arr.tolist()
arr = np.array(arr_list)
arr[np.equal(arr, None)] = np.nan
arr = np.array(arr.tolist())
return Table(arr)
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = fill_with_nan(data)
Cut out the middleman? fill_values=[('INDEF', np.nan)]) seems to work.