On my way through learning ML stuff I am confused by the MinMaxScaler provided by sklearn. The goal is to normalize numerical data into a range of [0, 1].
Example code:
from sklearn.preprocessing import MinMaxScaler
data = [[1, 2], [3, 4], [4, 5]]
scaler = MinMaxScaler(feature_range=(0, 1))
scaledData = scaler.fit_transform(data)
Giving output:
[[0. 0. ]
[0.66666667 0.66666667]
[1. 1. ]]
The first array [1, 2] got transformed into [0, 0] which in my eyes means:
The ratio between the numbers is gone
None value has any importance (anymore) as they both got set to min-value (0).
Example of what I have expected:
[[0.1, 0.2]
[0.3, 0.4]
[0.4, 0.5]]
This would have saved the ratios and put the numbers into the range of 0 to 1.
What am I doing wrong or misunderstanding with MinMaxScaler here? Because thinking of things like training on timeseries, it makes no sense to transform important numbers like prices or temperatures etc into broken stuff like above?
MinMaxScaler finds and translates the features according to a given range with the following formula according to the documentation. So you're issue is regarding the formula used.
Formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
Let us try and see what happens when you use it on your data.
You need to use numpy for this.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scaler = MinMaxScaler()
data = [[1, 2], [3, 4], [4, 5]]
# min to max range is given from the feature range you specify
min = 0
max = 1
X_std = (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0))
X_scaled = X_std * (max - min) + min
This returns as expected:
array([[0. , 0. ],
[0.66666667, 0.66666667],
[1. , 1. ]])
As for your doubts regarding using MinMaxScaler you could use StandardScaler if you have outliers that are quite different from most of the values, but are still valid data.
StandardScaler is used the same way as MinMaxScaler, but it will scale your values so they have mean equal to 0 and standard deviation equal to 1. Since those values will be found based on all the values in the series, it is much more robust against outliers.
Related
Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm
I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.
Here is the code.
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
>indices
>array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]])
>distances
>array([[0. , 1. ],[0. , 1. ],[0. , 1.41421356], [0. , 1. ],[0. , 1. ],[0. , 1.41421356]])
I don't really understand the shape of 'indices' and 'distances'. How do I understand what these numbers mean?
Its pretty straightforward actually. For each data sample in the input to kneighbors() (X here), it will show 2 neighbors. (Because you have specified n_neighbors=2. The indices will give you the index of training data (again X here) and distances will give you the distance for the corresponding data point in training data (to which the indices are referring).
Take an example of single data point. Assuming X[0] as the first query point, the answer will be indices[0] and distances[0]
So for X[0],
the index of first nearest neighbor in training data is indices[0, 0] = 0 and distance is distances[0, 0] = 0. You can use this index value to get the actual data sample from the training data.
This makes sense, because you used the same data for training and testing, so the first nearest neighbor for each point is itself and the distance is 0.
the index of second nearest neigbor is indices[0, 1] = 1 and distance is distances[0, 1] = 1
Similarly for all other points. The first dimension in indices and distances correspond to the query points and second dimension to the number of neighbors asked.
Maybe a little sketch will help
As an example, the closest point to the training sample with index 0 is 1, and since you are using n_neighbors = 2 (two neighbors) you would expect to see this pair in the results. And indeed you see that the pair [0, 1] appears in the output.
I will comment to the aforementioned, how you can get the "n_neighbors=2" neighbors using the indices array, in a pandas dataframe. So,
import pandas as pd
df = pd.DataFrame([X.iloc[indices[row,col]] for row in range(indices.shape[0]) for col in range(indices.shape[1])])
Right now I have a a 2 by 2 numpy array. By using RobustScaler, it normalizes each column one at a time, whereas I wish to normalize everything all at once. Is there anyway to do that?
From the documentation the RobustScaler:
removes the median and scales the data according to the quantile range
So you need to compute the median and the quantile range for the whole array, for this you can use the np.median and np.percentile functions, this is what sklearn does under the hood. The code:
import numpy as np
from sklearn.preprocessing import robust_scale
data = np.array([[3, 6],
[9, 12]], dtype=np.float64)
result = robust_scale(data, axis=0)
print(result)
reshape = data.reshape((1, 4))
result = robust_scale(reshape, axis=1)
me = np.median(data.flat) # 7.5
percentiles = np.percentile(data, (25.0, 75.0)) # 5.25 9.75
data -= me
data /= (percentiles[1] - percentiles[0])
print(data)
Output
[[-1. -1.]
[ 1. 1.]]
[[-1. -0.33333333]
[ 0.33333333 1. ]]
In the example I used (25.0, 75.0) because this are the default values for the quantile range, also the function robust_scale is equivalent to the functionality of RobustScaler (section See Also on the documentation).
I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.
I can run the simple pykalman Kalman Filter example given in the pykalman documentation:
import pykalman
import numpy as np
kf = pykalman.KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([[1,0], [0,0], [0,1]]) # 3 observations
(filtered_state_means, filtered_state_covariances) = kf.filter(measurements)
print filtered_state_means
This correctly returns the state estimates (one for each observation):
[[ 0.07285974 0.39708561]
[ 0.30309693 0.2328318 ]
[-0.5533711 -0.0415223 ]]
However, if I provide only a single observation, the code fails:
import pykalman
import numpy as np
kf = pykalman.KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([[1,0]]) # 1 observation
(filtered_state_means, filtered_state_covariances) = kf.filter(measurements)
print filtered_state_means
with the following error:
ValueError: could not broadcast input array from shape (2,2) into shape (2,1)
How can I use pykalman to update an initial state and initial covariance using just a single observation?
From the documentation at: http://pykalman.github.io/#kalmanfilter
filter_update(filtered_state_mean, filtered_state_covariance, observation=None, transition_matrix=None, transition_offset=None, transition_covariance=None, observation_matrix=None, observation_offset=None, observation_covariance=None)
This takes in the filtered_state_mean and filtered_state_covariance at time t, and an observation at t+1, and returns the state mean and state covariance at t+1 (to be used for the next update)
If I understand Kalman filter algorithm correctly, you can predict the state using just one observation. But, the gain and the covariance would be way off and the prediction would be nowhere close to the actual state.
You need to give a Kalman filter a few observations as a training set to reach a steady state