Related
I am testing the scipy.optimize function curve_fit(). I am testing on a Quadratic function, and I have assigned the x and y data manually for this question. I do get the expected answer for the values of my parameters for basically every guess I put in. However, I noticed that for guesses of the first parameter not close to 0 (particularly, after 1), I get a Covariance Matrix full of infinity. I am not sure why such a simple test is failing.
# python version: 3.9.7
# using a venv
# numpy version: 1.23.2
# scipy version: 1.9.0
import numpy as np
from scipy.optimize import curve_fit
# data taken from a quadratic function of: y = 3*x**2 + 2
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.float64)
y = np.array([2, 5, 14, 29, 50, 77, 110, 149, 194, 245, 302], dtype=np.float64)
# quadratic function
def func(x, a, b, c):
return a * x**2 + b * x + c
# test to reproduce success case - notice that we have success when changing the first value upto a value of 1.0
success = [0, 0, 0]
# test to reproduce failure case
failure = [4, 0, 0]
popt, pcov = curve_fit(func, x, y, p0=failure) # change p0 to success or failure
print(popt) # expected answer is [3, 0, 2]
print(pcov) # covariance matrix
I'm not sure why you're expecting a different covariance matrix. The documentation says:
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf
As far as I understand the Jacobian matrix is estimated during the optimization, and depending on what initialization you use the above case might happen. Note that the result of popt still converges!
The covariance matrix is really only useful (and in general, can only be calculated) when each and every variable is optimized. That generally means the variable is moved away from its initial value and in a way so that the dependence of the fit quality (typically, chi-square) from changing the value of this variable can be determined.
It also turns out that if initial guesses are bad, the solution may not be found -- and some variables may not actually be moved from their initial values. I think that is what is happening for you.
An initial value of "0" is particularly troublesome, as the fit really does not know "how zero" that is. Is that "magnitude less than 1e-16" or "magnitude less than 1"? Even using initial values of [4, 0.01, 0.01] would get to a good solution.
An additional potential problem is that your "data" is exactly given by the model function and values. At "the right solution", the residual will be really very very close to zero, and converting the Jacobian matrix of derivatives (of misfit with respect to variables) to covariance can be numerically unstable. That would be very unlikely with real data, but you may want to add a small amount of noise to the data being modeled.
I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.
This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.
This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)
I think it's kind of new question, where we didn't have solution. I need to implement some kind of smothering for a very big values in a list of numbers. For ex.
list = np.array([3, 3, 3, 15, 3, 3, 3])
I have made very simple implementation, with smothering such values. What I have tried so far.
def smooth(x, window, threshold):
for idx, val in enumerate(x):
if idx < window:
continue
avr = np.mean(
x[idx-window:idx])
if abs(avr - val) > threshold:
x[idx] = avr + threshold
print(smooth(list1, 3, 1))
# [3, 3, 3, 4, 3, 3, 3]
In this case, everything works Ok, but taking another example, I need to smooth data in a another way(gaussian smooth for ex).
list = np.array([3, 3, 3, 15, 15, 15])
print(smooth(list, 3, 1))
# [3, 3, 3, 4, 4, 3]
Because window moving from the left to right, I don't know norm of next value. Of course I can evaluate window for this numbers from both directions, but just wondering about right ways of doing that, or common technique.
I would advise against implementing 1D filtering yourself, since
you are likely to introduce artifacts into your data when taking a naive approach (as using a rectangular filter shape like you did in your code snippet).
you are unlikely to come up with a implementation remotely as fast as existing implementations, which have been optimized for decades
unless you are doing it for autodidactic reasons, it is a classic example of wasting your time by reinventing the wheel
Instead make use of the rich variety of existing implementations, available e.g. in the scipy package. You can find a nicely illustrated usage example here: Smoothing of a 1D signal (Scipy Cookbook)
I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.
You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.
I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)
For your specific example of stacking the last 4 data points, you might be able to use seglearn.
>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])
seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.
I am trying to figure out how to calculate covariance with the Python Numpy function cov. When I pass it two one-dimentional arrays, I get back a 2x2 matrix of results. I don't know what to do with that. I'm not great at statistics, but I believe covariance in such a situation should be a single number. This is what I am looking for. I wrote my own:
def cov(a, b):
if len(a) != len(b):
return
a_mean = np.mean(a)
b_mean = np.mean(b)
sum = 0
for i in range(0, len(a)):
sum += ((a[i] - a_mean) * (b[i] - b_mean))
return sum/(len(a)-1)
That works, but I figure the Numpy version is much more efficient, if I could figure out how to use it.
Does anybody know how to make the Numpy cov function perform like the one I wrote?
Thanks,
Dave
When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b).
The 2x2 array returned by np.cov(a,b) has elements equal to
cov(a,a) cov(a,b)
cov(a,b) cov(b,b)
(where, again, cov is the function you defined above.)
Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this:
numpy.cov(a, b, bias=True)[0][1]
or like this:
numpy.cov(a, b, ddof=0)[0][1]
Note that starting in Python 3.10, one can obtain the covariance directly from the standard library.
Using statistics.covariance which is a measure (the number you're looking for) of the joint variability of two inputs:
from statistics import covariance
# x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
covariance(x, y)
# 0.75