I think it's kind of new question, where we didn't have solution. I need to implement some kind of smothering for a very big values in a list of numbers. For ex.
list = np.array([3, 3, 3, 15, 3, 3, 3])
I have made very simple implementation, with smothering such values. What I have tried so far.
def smooth(x, window, threshold):
for idx, val in enumerate(x):
if idx < window:
continue
avr = np.mean(
x[idx-window:idx])
if abs(avr - val) > threshold:
x[idx] = avr + threshold
print(smooth(list1, 3, 1))
# [3, 3, 3, 4, 3, 3, 3]
In this case, everything works Ok, but taking another example, I need to smooth data in a another way(gaussian smooth for ex).
list = np.array([3, 3, 3, 15, 15, 15])
print(smooth(list, 3, 1))
# [3, 3, 3, 4, 4, 3]
Because window moving from the left to right, I don't know norm of next value. Of course I can evaluate window for this numbers from both directions, but just wondering about right ways of doing that, or common technique.
I would advise against implementing 1D filtering yourself, since
you are likely to introduce artifacts into your data when taking a naive approach (as using a rectangular filter shape like you did in your code snippet).
you are unlikely to come up with a implementation remotely as fast as existing implementations, which have been optimized for decades
unless you are doing it for autodidactic reasons, it is a classic example of wasting your time by reinventing the wheel
Instead make use of the rich variety of existing implementations, available e.g. in the scipy package. You can find a nicely illustrated usage example here: Smoothing of a 1D signal (Scipy Cookbook)
Related
Gist
Basically I want to perform an increase in dimension of two axes on a n-dimensonal tensor.
For some reason this operation seems very slow on bigger tensors.
If someone can give me a reason or better method I'd be very happy.
Goal
Going from (4, 8, 8, 4, 4, 4, 4, 4, 16, 8, 4, 4, 1) to (4, 32, 8, 4, 4, 4, 4, 4, 4, 8, 4, 4, 1) takes roughly 170 second. I'd like to improve on that. Below is an example, finding the correct indices is not necessary here.
Example Code
Increase dimension (0,2) of tensor
tensor = np.arange(16).reshape(2,2,4,1)
I = np.identity(4)
I tried 3 different methods:
np.kron
indices = [1,3,0,2]
result = np.kron(
I, tensor.transpose(indices)
).transpose(np.argsort(indices))
print(result.shape) # should be (8,2,16,1)
manual stacking
col = []
for i in range(4):
row = [np.zeros_like(tensor)]*4
row[i]=tensor
col.append(a)
result = np.array(col).transpose(0,2,3,1,4,5).reshape(8,2,16,1)
print(result.shape) # should be (8,2,16,1)
np.einsum
result =np.einsum("ij, abcd -> iabjcd", I, tensor).reshape(8,2,16,1)
print(result.shape) # should be (8,2,16,1)
Results
On my machine they performed the following (on the big example with complex entries):
np.einsum ~ 170s
manual stacking ~ 185s
np.kron ~ 580s
As Jérôme pointed out:
all your operations seems to involve a transposition which is known to be very expensive on modern hardware.
I reworked my algorithm to not rely on the dimensional increase by doing certain preprocessing steps. This indeed speeds up the overall process substantially.
I'm working with univariate splines from scipy. A simple example of one is as follows:
import scipy
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
f = scipy.interpolate.UnivariateSpline(x, y)
Is there any way I could make the resulting spline strictly increasing or strictly decreasing? I've noticed that, even if I feed it strictly increasing or decreasing data points, the result won't necessarily have this property.
Look for monotone interpolants, pchip and/or akima. These are at least locally monotone.
According to the NumPy documentation they may deprecate their np.matrix class. And while arrays do have their multitude of use cases, they cannot do everything. Specifically, they will "break" when doing pretty basic linear algebra operations (you can read more about it here).
Building my own matrix multiplication module in python is not too difficult, but it would not be optimized at all. I am looking for another library that has full linear algebra support which is optimized upon BLAS (Basic Linear Algebra Subprograms). Or at the least, is there any documents on how to DIY integrate a BLAS to python.
Edit: So some are suggesting the # operator, which is like pushing a mole down a hole and having him pop up immediately in the neighbouring one. In essence, what is happening is a debuggers nightmare:
W*x == w*x.T
W#x == W#x.T
You would hope that an error is raised here letting you know that you made a mistake in defining your matrices. But since arrays don't store 2D information if they are along one axis, I am not sure that the issue can ever be solved via np.array. (These problems don't exist with np.matrix but for some reason the developers seem insistent on removing it).
If you insist on the distinction between column and row vectors, you can do that.
>>> x = np.array([1, 2, 3]).reshape(-1, 1)
>>> W = np.arange(15).reshape(5, 3)
>>> x
array([[1],
[2],
[3]])
>>> W
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> W # x
array([[ 8],
[26],
[44],
[62],
[80]])
>>> W # x.T
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
You could create helper functions to create column and row vectors:
def rowvec(x):
return np.array(x).reshape(1, -1)
def colvec(x):
return np.array(x).reshape(-1, 1)
>>> rowvec([1, 2, 3])
array([[1, 2, 3]])
>>> colvec([1, 2, 3])
array([[1],
[2],
[3]])
I would recommend that you only use this type of constructs when you're porting existing Matlab code. You'll have trouble reading numpy code written by others and many library functions expect 1D arrays as inputs, not (1, n)-shaped arrays.
Actually, numpy offers BLAS-powered matrix mutiplication through the matmul operator #. This invokes the __matmul__ magic method for a given class.
All you have to do in the above example is W # x.
Other linear algebra stuff can be found on the np.linalg module.
Edit: I guess your problem is way more about the language's style than any technical issues. I found this answer very elucidative:
Transposing a NumPy array
Also, I find it very improbable that you will find something that is NOT numpy since most of the major machine learning/data science frameworks rely on it.
I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.
This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.
This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)
I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.
You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.
I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)
For your specific example of stacking the last 4 data points, you might be able to use seglearn.
>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])
seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.