How can I resize on time-series data on python - python

I'm trying to do deep-learning on time-series data.
There are 12 features for each data, but every series data don't have the same amount of data.
Some shape is [48,12], and some is [54,12], I'm trying to resize them into [50,12].
All I know until now is using resize in skimage.transform, but I don't know if it works well or not.
Is there any other solution for doing this?
For example, one of the features in the data looks like below.
The shape is [55, 1] I would like to reshape it to [50, 1].
a = np.array[-5.529309, -4.6293, -3.068647, -4.897388, -4.39951, -4.753769, -3.729291,
-4.973984, -5.060155, -4.686748, -4.696322, -3.939932, -3.470778, -6.209103,
-5.586756, -4.466532, -3.193116, -5.337818, -5.596331, -4.006954, -3.499502,
-3.413331, -6.304848, -4.322914, -4.246317, -5.759098, -5.893142, -6.381444,
-4.52398, -4.198445, -5.634629, -6.276124, -5.17505, -4.322914, -4.198445,
-4.600576, -4.39951, -4.945261, -5.759098, -4.677173, -3.623971, -5.692076,
-6.563361, -5.462287, -4.868664, -5.941015, -6.400594, -5.692076, -4.591002,
-6.027186, -5.960164, -6.256975, -5.414414, -5.730374, -6.726129]
If I using resize, the data will look like below.
Before resize and after resize:

One choice will be using TimeSeriesResampler in tslearn. This resizes a given time series to a fixed size you specified, by resampling data by (linear) interpolation.
https://tslearn.readthedocs.io/en/stable/gen_modules/preprocessing/tslearn.preprocessing.TimeSeriesResampler.html
Example:
from tslearn.preprocessing import TimeSeriesResampler
ts = np.arange(5)
new_ts = TimeSeriesResampler(sz=9).fit_transform(ts)
final_ts = np.squeeze(new_ts)
print(ts) # [0 1 2 3 4]
print(new_ts) # [[[0. ] [0.5] [1. ] [1.5] [2. ] [2.5] [3. ] [3.5] [4. ]]]
print(final_ts) # [0. 0.5 1. 1.5 2. 2.5 3. 3.5 4.]

Related

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

I can't seem to grasp how to use a radial basis function kernel for a classification task in python

I'm tasked with using Parzen windows with the radial basis function kernel to determine which label to give to a given point.
My training data set has 4 dimensions (4 features per point).
My training label set contains the labels (which can be 0,1,2,... depending on how many classes we have) for all the points in my training set (It's a 1D-array).
My test data set contains a couple of points with 4 dimensions but no labels so it's a nx4 array.
We're interested in giving labels for each of the points in my test data set.
I first compute the rdf kernel $k(x_i,x)$: (using python and numpy)
for (i, ex) in enumerate(test_data):
squared_distances = (np.sum((np.abs(ex - self.train_inputs)) ** 2, axis=1)) ** (1.0 / 2)
k = np.exp(- squared_distances/2*(np.square(self.sigma)))
Let's assume that test_data looks like this :
[[ 0.40614 1.3492 -1.4501 -0.55949]
[ -1.3887 -4.8773 6.4774 0.34179]
[ -3.7503 -13.4586 17.5932 -2.7771 ]
[ -3.5637 -8.3827 12.393 -1.2823 ]
[ -2.5419 -0.65804 2.6842 1.1952 ]]
ex is a point from the test data set. here as an example :
[ 0.40614 1.3492 -1.4501 -0.55949]
self.train_inputs is the training data set and it looks like this
[[ 3.6216 8.6661 -2.8073 -0.44699]
[ 4.5459 8.1674 -2.4586 -1.4621 ]
[ 3.866 -2.6383 1.9242 0.10645]
...
[-1.1667 -1.4237 2.9241 0.66119]
[-2.8391 -6.63 10.4849 -0.42113]
[-4.5046 -5.8126 10.8867 -0.52846]]
k is an array containing all the distances between every x_i (in self.training_inputs) and our current test point x (which is ex in the code).
k = [0.99837982 0.9983832 0.99874063 ... 0.9988909 0.99706044 0.99698724]
It's of the same length as the number of points in self.train_inputs.
My understanding of the radial basis function is that the closest the training points are to the test point the greater the value of k(current training point, test point). However k can never exceed 1 or be below 0.
So the goal is to select the training point that is the closest to the test point. We do this by looking which has the greatest value in k. Then we take its index and use that same index on the array containing the labels only. Therefore we get the label we want our test point to take.
In code it translates to this (the additional code is put below the first code snippet above) :
best_arg = np.argmax(k) #selects the greatest value in k and gives back its index.
classes_pred[i] = self.train_labels[best_arg] #we use the index to select the label in the train labels array.
Here self.train_labels looks like :
[0. 0. 0. ... 1. 1. 1.]
This approach gives for ex = [ 0.40614 1.3492 -1.4501 -0.55949] and k = [0.99837982 0.9983832 0.99874063 ... 0.9988909 0.99706044 0.99698724] :
818 for the index containing the greatest value in the current k and 1. as the label given self.train_labels[818] = 1.
However it seems that I'm doing this wrong. Given an already implemented algorithm by my teacher I get some of the labels wrong (especially when we have more then two classes). My question is am I doing this wrong? If yes where? I'm new to machine learning btw.

Paneling barplot from pivot table

I have a table like this:
ID var1. var2. var3. var4. var5. var6. var7 var8 var9 ... var22 ...
A. 1. 1. 7. 0. 0.6. 0. 7. 2. 2,4. ....
B 9. 1. 7. 0. 0.6. 0. 7. 2. 2,4. ....
C 0. 1. 0. 8. 0.5. 5. 7. 2.9. 2,8. ....
And I want to build a bar plot for each ID and bind them all on a panel, my idea of panel is like here.
So, 'x' will be the variables (that are columns names) and the 'y' the values that are the values of the columns in this data frame.
One important thing is that I don't want to show on the graph the variables that have zero value for a given ID, so, for example: for ID 'A', var4 and var6 won't be on the graph of ID 'A', but they will be on the graph of ID 'C', for example.
So far I have:
The transposition of the columns:
df_melted = res.melt(id_vars='ID')
Then I remove all the zeros:
df_melted_no_zeros = df_melted[df_melted.value != 0]
Then as I could not manage to build the panel, I filter by one ID:
ID_A = df_melted_no_zeros[(df_melted_no_zeros.ID == "A")]
Then on the plot that are so many variables and I can't find how can I put them on a plot to be readable, since there are so many names (like 20 for each graph on the x axis). For me it will work to show just the legend of top 5 values, but I couldn't manage how to do it.
Als my variables are mostly between 0.004 and 0.009 but then there are always two variables that are value 4 or 5, so the rest is like irrelevant on the plot.
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
variables = ID_A['variable']
values = ID_A['value']
ax.bar(variables,values)
plt.show()
With this subset of your DataFrame:
ID var1. var2. var3. var4. var5. var6. var7 var8 var9
A. 1. 1. 7. 0. 0.6 0. 7. 2. 2.4
B 9. 1. 7. 0. 0.6 0. 7. 2. 2.4
C 0. 1. 0. 8. 0. 5. 7. 2.9 2.8
From wide to long:
df_melted = df.melt(id_vars='ID')
Get just top several:
df_top5 = df_melted[df_melted['value'].isin(
df_melted.groupby('ID')['value'].nlargest(5).unique())]
Plot with seaborn:
import seaborn as sns
g = sns.FacetGrid(df_top5,col="ID",sharex=False,sharey=False,col_wrap=2)
g.map(sns.barplot,'variable','value',order=None,hue=df_top5['variable'],
dodge=False,palette='deep')
plt.show()
Result:

confused by apply function of GradientBoostingClassifier

For apply function, you can refer to here
My confusion is more from this sample, and I have added some print to below code snippet to output more debug information,
grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
test_var = grd.apply(X_train)[:, :, 0]
print "test_var.shape", test_var.shape
print "test_var", test_var
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
The output is like below, and confused what are the numbers like 6., 3. and 10. mean? And how they are related to the final classification result?
test_var.shape (20000, 10)
test_var [[ 6. 6. 6. ..., 10. 10. 10.]
[ 10. 10. 10. ..., 3. 3. 3.]
[ 6. 6. 6. ..., 11. 10. 10.]
...,
[ 6. 6. 6. ..., 10. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]]
To understand gradient boosting, you need first to understand individual trees. I will show a small example.
Here is the setup: a small GB model trained on Iris dataset to predict whether a flower belongs to the class 2.
# import the most common dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
X, y = load_iris(return_X_y=True)
# there are 150 observations and 4 features
print(X.shape) # (150, 4)
# let's build a small model = 5 trees with depth no more than 2
model = GradientBoostingClassifier(n_estimators=5, max_depth=2, learning_rate=1.0)
model.fit(X, y==2) # predict 2nd class vs rest, for simplicity
# we can access individual trees
trees = model.estimators_.ravel()
print(len(trees)) # 5
# there are 150 observations, each is encoded by 5 trees, each tree has 1 output
applied = model.apply(X)
print(applied.shape) # (150, 5, 1)
print(applied[0].T) # [[2. 2. 2. 5. 2.]] - a single row of the apply() result
print(X[0]) # [5.1 3.5 1.4 0.2] - the pbservation corresponding to that row
print(trees[0].apply(X[[0]])) # [2] - 2 is the result of application the 0'th tree to the sample
print(trees[3].apply(X[[0]])) # [5] - 5 is the result of application the 3'th tree to the sample
You can see that each number in the sequence [2. 2. 2. 5. 2.] produced by model.apply() corresponds to an output of a single tree. But what do these numbers mean?
We can easily analyse decision trees by visual examination. Here is a function to plot one
# a function to draw a tree. You need pydotplus and graphviz installed
# sudo apt-get install graphviz
# pip install pydotplus
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
def plot_tree(clf):
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, node_ids=True,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return Image(graph.create_png())
# now we can plot the first tree
plot_tree(trees[0])
You can see that each node has a number (from 0 to 6). If we push our single example into this tree, it will first go to node #1 (because the feature x3 has value 0.2 < 1.75), and then to node #2 (because the feature x2 has value 1.4 < 4.95.
In the same way, we can analyze the tree 3 which has produced the output 5:
plot_tree(trees[3])
Here our observation goes first to node #4 and then to node #5, because x1=3.5>2.25 and x2=1.4<4.85. Thus, it ends up with number 5.
It's that simple! Each number produced by apply() is the ordinal number of the node of the corresponding tree in which the sample ends up.
The relation of these numbers to the final classification result is through the value of the leaves in the corresponding trees. In case of binary classification, the value in all leaves just adds up, and if it is positive, then the 'positive' wins, otherwise the 'negative' class. In case of multiclass classification, the values add up for each class, and the class with the largest total value wins.
In our case, the first tree (with its node #2) gives value -1.454, the other trees also give some values, and total sum of them is -4.84. It is negative, thus, our example does not belong to class 2.
values = [trees[i].tree_.value[int(leaf)][0,0] for i, leaf in enumerate(applied[0].ravel())]
print(values) # [-1.454, -1.05, -0.74, -1.016, -0.58] - the values of nodes [2,2,2,5,2] in the corresponding trees
print(sum(values)) # -4.84 - sum of these values is negative -> this is not class 2

Rescale price list from a longer length to a smaller length

Given the following pandas data frame with 60 elements.
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
Is there a formula to rescale this in such a way so that for each window bigger than 20 elements starting from index 0 to index i+1, the data is rescaled down to 20 elements?
Here is a loop that is creating the windows with the data for rescaling, i just do not know any way of doing the rescaling itself for this problem at hand. Any suggestions on how this might be done?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
Basically after the rescaling the rescaledData should have 40 arrays, each with a length of 20 prices.
From reading the paper, it looks like you are resizing the list back to 20 indices, then interpolating the data at your 20 indices.
We'll make the indices like they do (range(0, len(large), step = len(large)/miniLenght)), then use numpys interp - there are a million ways of interpolating data. np.interp uses a linear interpolation, so if you asked for eg index 1.5, you get the mean of points 1 and 2, and so on.
So, here's a quick modification of your code to do it (nb, we could probably fully vectorize this using 'rolling'):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
And the output is as expected:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....

Categories