How to handle missing data in KNN without imputing? - python

I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.

This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.

This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)

Related

why percentile() method is not calculating the appropriate percentile? Like the 25th percentile for this data should be 1.5 and 2 if rounded off

import numpy as np
value = [1, 2, 3, 4, 5, 6]
x = np.percentile(value, 25)
print(x)
I am calculating percentile using this code to cross verify
import sys
import numpy as np
from numpy import math
def my_percentile(data, percentile):
n = len(data)
p = n * percentile / 100
if p.is_integer():
return sorted(data)[int(p)]
else:
return sorted(data)[int(math.ceil(p)) - 1]
t = [1, 2, 3, 4, 5, 6]
per = my_percentile(t, 25)
print(per)
There's more than one way to calculate quartiles. Wikipedia has a good summary under quantiles.
The values returned by numpy's default calculation match those returned by, for example, R's summary() function.
You need to do one of these things.
Switch to numpy.percentile's default way of calculating quartiles,
provide a value to numpy.percentile's parameter interpolation, or
write your own custom function.
Valid values for interpolation in numpy.percentile are here.
I didn't suggest a value for interpolation, because you didn't include your expected output in your question. You need to consider the effect of your decision on all quartiles, not just on one.
(I don't think scipy.stats.percentileofscore() will work for you.

Find nearest neighbors for arrays of different dimentions

I have to compute a similarity measure on several thousands of uneven arrays.
The naive implementation is basically in O(n²) and it's taking too long for the number of arrays I have.
Hopefully, I'm interested only in the similarity for the most similar arrays.
So far I used the sci-kit learn implementation of NearestNeighbors which does the job for arrays with the same number of dimensions. However, NearestNeighbors is based on a KD-tree and I think it's not possible to apply this algorithm for uneven arrays.
Is there any alternative for arrays of different dimensions?
Here is a code snippet summarizing the problem:
import numpy as np
from sklearn.neighbors.unsupervised import NearestNeighbors
def partial_mse(a: np.array, b: np.array) -> float:
def mse(a: np.array, b: np.array) -> float:
mse = (np.square(a - b)).mean()
return -np.sqrt(mse)
if a.size == b.size:
return mse(a, b)
# a is always the bigger one
if a.size < b.size:
a, b = b, a
partial_mse = [mse(a[i:i + b.size], b) for i in range(a.size - b.size + 1)]
return np.max(partial_mse)
uneven_array = np.array([[1, 2, 3, 4], [3, 4], [3, 2, 6], [2, 1, 3], [3]])
even_array = np.array([[1, 2, 3, 4], [3,2, 4, 1], [3, 2, 6, 1], [2, 6, 1, 3], [3, 5, 2, 0]])
nnfit = NearestNeighbors(n_neighbors=2, algorithm='auto', n_jobs=-1,
metric=partial_mse, metric_params={}).fit(uneven_array)
ValueError: setting an array element with a sequence.
NearestNeighbour algorithms are based on abstracting the arrays as a n-dimensional point. So, having points of different dimensions are going to throw the algorithm out of whack, and possibly won't give you what you were looking for even if you managed to implement it.
if n is the maximum number of dimension, then each lower dimension (k) point actually corresponds to (n-k+1) possible points in the higher dimension space (by filling the missing dimensions with the elements of array a), and the metric you chose would return the maximum similarity out of the (n-k+1) points.
After several tries I found that:
Filling the space with a default value is the only way to use NearestNeighbors and KD-tree.
However, the default value is contaminating the similarity function. The most similar part of the features will be the part with the same filling value.
I fixed it by adding the filling value as parameter of partial_mse and filtering out this value inside partial_mse. This filling value should be a value that doesn't exist on the arrays, otherwise, it will filter out true values !
def partial_mse(a: np.array, b: np.array, **kwargs) -> float:
[...]
fill_value = kwargs["fill_value"]
a, b = a[a != fill_value], b[b != fill_value]
[...]
nnfit = NearestNeighbors(n_neighbors=10, algorithm='auto', n_jobs=-1, \
metric=partial_mse, metric_params={"fill_value": fill_value).fit(matrix_features)

Cosine similarity for very large dataset

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call
Here's my use-case and how I am currently tackling it.
Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)
parent_vector = [1, 2, 3, 4 ..., 100]
Here are my child vectors (with some made-up random numbers for this example)
child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]
My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.
My current approach (which I know is inefficient and memory consuming):
Step 1: Create a super-dataframe of following shape
parent_vector 1, 2, 3, ....., 100
child_vector_1 2, 3, 4, ....., 101
child_vector_2 3, 4, 5, ....., 102
child_vector_3 4, 5, 6, ....., 103
......................................
child_vector_500000 3, 4, 5, ....., 103
Step 2: Use
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
to get pair-wise cosine similarity between all vectors (shown in above dataframe)
Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.
Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.
PS: I know this is highly inefficient but I couldn't think of a better
way to faster compute cosine similarity between each of child vector
and parent vector and get the top-N values.
Any help would be highly appreciated.
even though your (500000, 100) array (the parent and its children) fits into memory
any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.
Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.
To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.rand(500000,100))
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately
n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children
hope that solves your question.
I couldn't even fit the entire corpus in memory so a solution for me was to load it gradually and compute cosine similarity on smaller batches, always retaining the least/most n (depending on your usecase) similar items:
data = []
iterations = 0
with open('/media/corpus.txt', 'r') as f:
for line in f:
data.append(line)
if len(data) <= 1000:
pass
else:
print('Getting bottom k, iteration {x}'.format(x=iterations))
data = get_bottom_k(data, 500)
iterations += 1
filtered = get_bottom_k(data, 500) # final most different 500 texts in corpus
def get_bottom_k(corpus:list, k:int):
pairwise_similarity = make_similarity_matrix(corpus) # returns pairwise similarity matrix
sums = csr_matrix.sum(pairwise_similarity, axis=1) # Similarity index for each item in corpus. Bigger > more
sums = np.squeeze(np.asarray(sums))
# similar to other txt.
indexes = np.argpartition(sums, k, axis=0)[:k] # Bottom k in terms of similarity (-k for top and [-k:])
return [corpus[i] for i in indexes]
This is by far an optimal solution but it's the easiest i found so far, maybe it will be of help to someone.
This solution is insanely fast
child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm = child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities

scikit-learn custom transformer / pipeline that changes X and Y

I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.
You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.
I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)
For your specific example of stacking the last 4 data points, you might be able to use seglearn.
>>> import numpy as np
>>> import seglearn
>>> x = np.arange(10)[None,:]
>>> x
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> y = x
>>> new_x, new_y, _ = seglearn.transform.SegmentXY(width=4, overlap=0.75).fit_transform(x, y)
>>> new_x
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8],
[6, 7, 8, 9]])
>>> new_y
array([3, 4, 5, 6, 7, 8, 9])
seglearn claims to be scikit-learn-compatible, so you should be able to fit SegmentXY in the beginning of a scikit-learn pipeline. However, I have not tried it in a pipeline myself.

numpy.polyfit gives empty residuals array

I use numpy.polyfit to fit a 2nd order polynom to a set of data
fit1, fit_err1, _, _, _ = np.polyfit(xint[:index_max],
yint[:index_max],
2,
full=True)
For some few examples of my data, the variable fit_err1 is empty although the fit was successful, i.e. fit1 is not empty!
Does anybody know what an empty residual means in this context? Thank you!
EDIT:
one example data set:
x = [-488., -478., -473.]
y = [ 0.02080881, 0.03233648, 0.03584448]
fit1, fit_err1, _, _, _ = np.polyfit(x, y, 2, full=True)
result:
fit1 = [ -3.00778818e-05 -2.79024663e-02 -6.43272769e+00]
fit_err1 = []
I know that fitting a 2nd order polynom to a set of three point is not very useful, but then i still expect the function to either raise a warning, or (as it actually determined a fit) return the actual residuals, or both (like "here are the residuals, but your conditions are poor!").
As pointed out by #Jaime, if you have three points a second order polynomial will fit it exactly. And your point that the error should be rather 0 than an empty array makes sense, but this is the current behavior of np.linalg.lstsq, which is where np.polyfit is wrapped around.
We can test this behavior doing the least-squares fit of a y = a*x**0 + b*x**1 + c*x**2 equation that we know the answer should be a=0, b=0, c=1:
np.linalg.lstsq([[1, 1 ,1], [1, 2, 4], [1, 3, 9]], [1, 4, 9])
#(array([ -3.43396424e-15, 3.88578059e-15, 1.00000000e+00]),
# array([], dtype=float64),
# 3,
# array([ 10.64956309, 1.2507034 , 0.15015641]))
where we can see that the second output is an empty array. And this is intended to work like this.

Categories