Related
I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:
I'm trying to do deep-learning on time-series data.
There are 12 features for each data, but every series data don't have the same amount of data.
Some shape is [48,12], and some is [54,12], I'm trying to resize them into [50,12].
All I know until now is using resize in skimage.transform, but I don't know if it works well or not.
Is there any other solution for doing this?
For example, one of the features in the data looks like below.
The shape is [55, 1] I would like to reshape it to [50, 1].
a = np.array[-5.529309, -4.6293, -3.068647, -4.897388, -4.39951, -4.753769, -3.729291,
-4.973984, -5.060155, -4.686748, -4.696322, -3.939932, -3.470778, -6.209103,
-5.586756, -4.466532, -3.193116, -5.337818, -5.596331, -4.006954, -3.499502,
-3.413331, -6.304848, -4.322914, -4.246317, -5.759098, -5.893142, -6.381444,
-4.52398, -4.198445, -5.634629, -6.276124, -5.17505, -4.322914, -4.198445,
-4.600576, -4.39951, -4.945261, -5.759098, -4.677173, -3.623971, -5.692076,
-6.563361, -5.462287, -4.868664, -5.941015, -6.400594, -5.692076, -4.591002,
-6.027186, -5.960164, -6.256975, -5.414414, -5.730374, -6.726129]
If I using resize, the data will look like below.
Before resize and after resize:
One choice will be using TimeSeriesResampler in tslearn. This resizes a given time series to a fixed size you specified, by resampling data by (linear) interpolation.
https://tslearn.readthedocs.io/en/stable/gen_modules/preprocessing/tslearn.preprocessing.TimeSeriesResampler.html
Example:
from tslearn.preprocessing import TimeSeriesResampler
ts = np.arange(5)
new_ts = TimeSeriesResampler(sz=9).fit_transform(ts)
final_ts = np.squeeze(new_ts)
print(ts) # [0 1 2 3 4]
print(new_ts) # [[[0. ] [0.5] [1. ] [1.5] [2. ] [2.5] [3. ] [3.5] [4. ]]]
print(final_ts) # [0. 0.5 1. 1.5 2. 2.5 3. 3.5 4.]
For apply function, you can refer to here
My confusion is more from this sample, and I have added some print to below code snippet to output more debug information,
grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
test_var = grd.apply(X_train)[:, :, 0]
print "test_var.shape", test_var.shape
print "test_var", test_var
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
The output is like below, and confused what are the numbers like 6., 3. and 10. mean? And how they are related to the final classification result?
test_var.shape (20000, 10)
test_var [[ 6. 6. 6. ..., 10. 10. 10.]
[ 10. 10. 10. ..., 3. 3. 3.]
[ 6. 6. 6. ..., 11. 10. 10.]
...,
[ 6. 6. 6. ..., 10. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]]
To understand gradient boosting, you need first to understand individual trees. I will show a small example.
Here is the setup: a small GB model trained on Iris dataset to predict whether a flower belongs to the class 2.
# import the most common dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
X, y = load_iris(return_X_y=True)
# there are 150 observations and 4 features
print(X.shape) # (150, 4)
# let's build a small model = 5 trees with depth no more than 2
model = GradientBoostingClassifier(n_estimators=5, max_depth=2, learning_rate=1.0)
model.fit(X, y==2) # predict 2nd class vs rest, for simplicity
# we can access individual trees
trees = model.estimators_.ravel()
print(len(trees)) # 5
# there are 150 observations, each is encoded by 5 trees, each tree has 1 output
applied = model.apply(X)
print(applied.shape) # (150, 5, 1)
print(applied[0].T) # [[2. 2. 2. 5. 2.]] - a single row of the apply() result
print(X[0]) # [5.1 3.5 1.4 0.2] - the pbservation corresponding to that row
print(trees[0].apply(X[[0]])) # [2] - 2 is the result of application the 0'th tree to the sample
print(trees[3].apply(X[[0]])) # [5] - 5 is the result of application the 3'th tree to the sample
You can see that each number in the sequence [2. 2. 2. 5. 2.] produced by model.apply() corresponds to an output of a single tree. But what do these numbers mean?
We can easily analyse decision trees by visual examination. Here is a function to plot one
# a function to draw a tree. You need pydotplus and graphviz installed
# sudo apt-get install graphviz
# pip install pydotplus
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
def plot_tree(clf):
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, node_ids=True,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return Image(graph.create_png())
# now we can plot the first tree
plot_tree(trees[0])
You can see that each node has a number (from 0 to 6). If we push our single example into this tree, it will first go to node #1 (because the feature x3 has value 0.2 < 1.75), and then to node #2 (because the feature x2 has value 1.4 < 4.95.
In the same way, we can analyze the tree 3 which has produced the output 5:
plot_tree(trees[3])
Here our observation goes first to node #4 and then to node #5, because x1=3.5>2.25 and x2=1.4<4.85. Thus, it ends up with number 5.
It's that simple! Each number produced by apply() is the ordinal number of the node of the corresponding tree in which the sample ends up.
The relation of these numbers to the final classification result is through the value of the leaves in the corresponding trees. In case of binary classification, the value in all leaves just adds up, and if it is positive, then the 'positive' wins, otherwise the 'negative' class. In case of multiclass classification, the values add up for each class, and the class with the largest total value wins.
In our case, the first tree (with its node #2) gives value -1.454, the other trees also give some values, and total sum of them is -4.84. It is negative, thus, our example does not belong to class 2.
values = [trees[i].tree_.value[int(leaf)][0,0] for i, leaf in enumerate(applied[0].ravel())]
print(values) # [-1.454, -1.05, -0.74, -1.016, -0.58] - the values of nodes [2,2,2,5,2] in the corresponding trees
print(sum(values)) # -4.84 - sum of these values is negative -> this is not class 2
I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 and a standard deviation of 15.
df = pd.DataFrame(15 * np.random.randn(5, 4) + 60, columns=['Math', 'Science', 'History', 'Art'])
What I can’t figure out is how to make it so that a student’s Science mark is highly correlated to their Math mark, and that their History and Art marks are less so, but still somewhat correlated to the Math mark.
I’m neither a statistician or an expert programmer, so a less sophisticated but more easily understood solution is what I’m hoping for.
Let's put what has been suggested by #Daniel into code.
Step 1
Let's import multivariate_normal:
import numpy as np
from scipy.stats import multivariate_normal as mvn
Step 2
Let's construct covariance data and generate data:
cov = np.array([[1, 0.8,.7, .6],[.8,1.,.5,.5],[0.7,.5,1.,.5],[0.6,.5,.5,1]])
cov
array([[ 1. , 0.8, 0.7, 0.6],
[ 0.8, 1. , 0.5, 0.5],
[ 0.7, 0.5, 1. , 0.5],
[ 0.6, 0.5, 0.5, 1. ]])
This is the key step. Note, that covariance matrix has 1's in diagonal, and the covariances decrease as you step from left to right.
Now we are ready to generate data, let's sat 1'000 points:
scores = mvn.rvs(mean = [60.,60.,60.,60.], cov=cov, size = 1000)
Sanity check (from covariance matrix to simple correlations):
np.corrcoef(scores.T):
array([[ 1. , 0.78886583, 0.70198586, 0.56810058],
[ 0.78886583, 1. , 0.49187904, 0.45994833],
[ 0.70198586, 0.49187904, 1. , 0.4755558 ],
[ 0.56810058, 0.45994833, 0.4755558 , 1. ]])
Note, that np.corrcoef expects your data in rows.
Finally, let's put your data into Pandas' DataFrame:
df = pd.DataFrame(data = scores, columns = ["Math", "Science","History", "Art"])
df.head()
Math Science History Art
0 60.629673 61.238697 61.805788 61.848049
1 59.728172 60.095608 61.139197 61.610891
2 61.205913 60.812307 60.822623 59.497453
3 60.581532 62.163044 59.277956 60.992206
4 61.408262 59.894078 61.154003 61.730079
Step 3
Let's visualize some data that we've just generated:
ax = df.plot(x = "Math",y="Art", kind="scatter", color = "r", alpha = .5, label = "Art, $corr_{Math}$ = .6")
df.plot(x = "Math",y="Science", kind="scatter", ax = ax, color = "b", alpha = .2, label = "Science, $corr_{Math}$ = .8")
ax.set_ylabel("Art and Science");
The statistical tool for that is the covariance matrix: https://en.wikipedia.org/wiki/Covariance.
Each cell (i,j) is representing the dependecy between the variable i and the variable j, so in your case it can be between math and science. If there is no dependency the value would be 0.
What you did was assuming that the covariance was a diagonal matrix with the same values on the diagonal. So what you have to do is defines your covariance matrix and afterwards draw the samples from a gaussian with numpy.random.multivariate_normal https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html or any other distribution functions.
Thank you guys for the responses; they were extremely useful. I adapted the code provided by Sergey to produce the result I was looking for, which was records with Math and Science marks that are relatively close most of the time, and History and Art marks that are more independent.
The following produced data that looks reasonable:
cov = np.array([[1, 0.5,.2, .1],[.5,1.,.1,.1],[0.2,.1,1,.3],[0.1,.1,.3,1]])
scores = mvn.rvs(mean = [0.,0.,0.,0.], cov=cov, size = 100)
df = pd.DataFrame(data = 15 * scores + 60, columns = ["Math","Science","History", "Art"])
df.head(10)
The next step would be to make it so that each subject has a different mean, but I have an idea of how to do that. Thanks again.
example dataframe
Given the following pandas data frame with 60 elements.
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
Is there a formula to rescale this in such a way so that for each window bigger than 20 elements starting from index 0 to index i+1, the data is rescaled down to 20 elements?
Here is a loop that is creating the windows with the data for rescaling, i just do not know any way of doing the rescaling itself for this problem at hand. Any suggestions on how this might be done?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
Basically after the rescaling the rescaledData should have 40 arrays, each with a length of 20 prices.
From reading the paper, it looks like you are resizing the list back to 20 indices, then interpolating the data at your 20 indices.
We'll make the indices like they do (range(0, len(large), step = len(large)/miniLenght)), then use numpys interp - there are a million ways of interpolating data. np.interp uses a linear interpolation, so if you asked for eg index 1.5, you get the mean of points 1 and 2, and so on.
So, here's a quick modification of your code to do it (nb, we could probably fully vectorize this using 'rolling'):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
And the output is as expected:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....