scikit-learn custom transformer / pipeline that changes X and Y - python

I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.

You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.

I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)

For your specific example of stacking the last 4 data points, you might be able to use seglearn.
>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])
seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.

Related

Assert Scipy Univariate Spline Strictly Increasing

I'm working with univariate splines from scipy. A simple example of one is as follows:
import scipy
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
f = scipy.interpolate.UnivariateSpline(x, y)
Is there any way I could make the resulting spline strictly increasing or strictly decreasing? I've noticed that, even if I feed it strictly increasing or decreasing data points, the result won't necessarily have this property.
Look for monotone interpolants, pchip and/or akima. These are at least locally monotone.

Create interaction term in scikit-learn

There are certainly many ways of creating interaction terms in Python, whether by using numpy or pandas directly, or some library like patsy. However, I was looking for a way of creating interaction terms scikit-learn style, i.e. in a form that plays nicely with its fit-transform-predict paradigm. How might I do this?
Let's consider the case of making an interaction term between two variables.
You might make use of the FunctionTransformer class, like so:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# 5 rows, 2 columns
X = np.arange(10).reshape(5, 2)
# Appends interaction of columns at 0 and 1 indices to original matrix
interaction_append_function = lambda x: np.append(x, (x[:, 0] * x[:, 1])[:, None], 1)
interaction_transformer = FunctionTransformer(func=interaction_append_function)
Let's try it out:
>>> interaction_transformer.fit_transform(X)
array([[ 0, 1, 0],
[ 2, 3, 6],
[ 4, 5, 20],
[ 6, 7, 42],
[ 8, 9, 72]])
You now have a transformer that will play well with other workflows like sklearn.pipeline or sklearn.compose.
Certainly there are more extensible ways of handling this, but hopefully you get the idea.

How to handle missing data in KNN without imputing?

I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.
This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.
This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)

Smooth aggressive values in the list

I think it's kind of new question, where we didn't have solution. I need to implement some kind of smothering for a very big values in a list of numbers. For ex.
list = np.array([3, 3, 3, 15, 3, 3, 3])
I have made very simple implementation, with smothering such values. What I have tried so far.
def smooth(x, window, threshold):
for idx, val in enumerate(x):
if idx < window:
continue
avr = np.mean(
x[idx-window:idx])
if abs(avr - val) > threshold:
x[idx] = avr + threshold
print(smooth(list1, 3, 1))
# [3, 3, 3, 4, 3, 3, 3]
In this case, everything works Ok, but taking another example, I need to smooth data in a another way(gaussian smooth for ex).
list = np.array([3, 3, 3, 15, 15, 15])
print(smooth(list, 3, 1))
# [3, 3, 3, 4, 4, 3]
Because window moving from the left to right, I don't know norm of next value. Of course I can evaluate window for this numbers from both directions, but just wondering about right ways of doing that, or common technique.
I would advise against implementing 1D filtering yourself, since
you are likely to introduce artifacts into your data when taking a naive approach (as using a rectangular filter shape like you did in your code snippet).
you are unlikely to come up with a implementation remotely as fast as existing implementations, which have been optimized for decades
unless you are doing it for autodidactic reasons, it is a classic example of wasting your time by reinventing the wheel
Instead make use of the rich variety of existing implementations, available e.g. in the scipy package. You can find a nicely illustrated usage example here: Smoothing of a 1D signal (Scipy Cookbook)

Apply custom function to 2 or more rows (or columns) in numpy

I am fairly new to numpy. I want to apply a custom function to 1, 2 or more rows (or columns). How can I do this? Before this is marked as duplicate, I want to point out that the only thread I found that does this is how to apply a generic function over numpy rows? and how to apply a generic function over numpy rows?. There are two issues with this post:
a) As a beginner, I am not quite sure what operation like A[:,None,:] does.
b) That operation doesn't work in my case. Please see below.
Let's assume that Matrix M is:
import numpy as np
M = np.array([[8, 3, 2],
[6, 1, 2],
[1, 2, 4]])
Now, I would want to calculate product of combination of all three rows. For this, I have created a custom function. Actual operation of the function could be different from multiplication. Multiplication is just an example.
def myf(a,b): return(a*b)
I have taken numpy array product as an example. Actual custom function could be different, but no matter what the operation is, the function will always return a numpy array. i.e. it will take two equally-sized numpy 1-D array and return 1-D array. In myf I am assuming that a and b are each np.array.
I want to be able to apply custom function to any two rows or columns, or even three rows (recursively applying function).
Expected output after multiplying two rows recursively:
If I apply pairwise row-operation:
[[48,3,4],
[6,2,8],
[8,6,8]]
OR ( The order of application of custom function doesn't matter. Hence, the actual position of rows in the output matrix won't matter. Below matrix will be fine as well.)
[[6,2,8],
[48,3,4], #row1 and 2 are swapped
[8,6,8]]
Similarly, if I apply pairwise operation on columns, I would get
[[24, 6, 16]
[6, 2, 12]
[2, 8, 4]]
Similarly, if I apply custom function to all three rows, I would get:
[48,6,16] #row-wise
OR
[48,12,8] #column-wise
I tried a few approaches after reading SO:
1:
vf=np.vectorize(myf)
vf(M,M)
However, above function applies custom function element-wise rather than row-wise or columnwise.
2:
I also tried:
M[:,None,:].dot(M) #dot mimics multiplication. Python wouldn't accept `*`
There are two problems with this:
a) I don't know what the output is.
b) I cannot apply custom function.
Can someone please help me? I'd appreciate any help.
I am open to numpy and scipy.
Some experts have requested desired output. Let's assume that the desired output is
[[48,3,4],
[6,2,8],
[8,6,8]].
However, I'd appreciate some guidance on customizing the solution for 2 or more columns and 2 or more rows.
You can simply roll your axis along the 0th axis
np.roll(M, -1, axis=0)
# array([[6, 1, 2],
# [1, 2, 4],
# [8, 3, 2]])
And multiply the result with your original array
M * np.roll(M, -1, axis=0)
# array([[48, 3, 4],
# [ 6, 2, 8],
# [ 8, 6, 8]])
If you want to incorporate more than two rows, you can roll it more than once:
M * np.roll(M, -1, axis=0) * np.roll(M, -2, axis=0)
# array([[48, 6, 16],
# [48, 6, 16],
# [48, 6, 16]])

Categories