running lasso and ridge regression on pandas dataframe - python

I have the following code which successfully runs an OLS regression on the supplied dataset:
y = df['SPXR_{}D'.format(window)]
x = df[cols]
x = sm.add_constant(x)
mod = sm.OLS(y, x)
res = mod.fit()
How would I run lasso and ridge instead? I can't seem to find any statsmodels function or package to do this.
Updated code using sklearn:
y = df['SPXR_{}D'.format(window)]
x = df[cols]
x = sm.add_constant(x)
mod = linear_model.Lasso()
res = mod.fit(x, y)
print(res.coef_)
print(res.intercept_)
res.coef_ looks like this:
[ 0. 0. -0. 0. -0. -0. -0. 0. 0. -0. 0. 0. 0. -0. -0. 0. -0.]
Is there an issue in how I'm using the function? (perhaps I shouldn't be using statsmodels to add the alpha constants to my DF?)

As sacul writes, it is better to use sklearn for these things. In this case,
from sklearn import linear_model
rgr = linear_model.Ridge().fit(x, y)
Note the following:
The fit_intercept=True parameter of Ridge alleviates the need to manually add the constant as you did.
Shameless plug: I wrote ibex, a library that aims to make sklearn work better with pandas.

Related

Truncated SVD and PCA

Theoretically, the projection result of PCA and SVD is the same if the feature has mean 0. So I tried it on python.
from sklearn import datasets
cancer = datasets.load_breast_cancer()
from sklearn.preprocessing import StandardScaler
# we can set our feature to have mean 0 by setting with_mean=False
scaler = StandardScaler(with_mean=False,with_std=False)
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)
from sklearn.decomposition import PCA
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
from sklearn.decomposition import TruncatedSVD
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
But when I print the result, it is different. Why is this happening?
print(X_pca)
print(X_svdm)
>>>[[1160.1425737 -293.91754364 48.57839763]
[1269.12244319 15.63018184 -35.39453423]
[ 995.79388896 39.15674324 -1.70975298]
...
[ 314.50175618 47.55352518 -10.44240718]
[1124.85811531 34.12922497 -19.74208742]
[-771.52762188 -88.64310636 23.88903189]]
>>>[[2241.97427647 347.71556015 -27.53741942]
[2372.40840267 56.90166991 23.86316187]
[2101.8402797 11.94762737 30.41138602]
...
[1424.53280954 -55.0217124 -3.5794351 ]
[2231.65579282 19.99439854 3.31619182]
[ 331.69302638 -5.29733966 -39.12136435]]
What should I fix so I can get the same result of both algorithm?
From the help page for scaler :
with_mean bool, default=True If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
For PCA and SVD to give the same output, you need to center and scale the data, see also this post for details, so if you do:
# which is also the default
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(cancer.data)
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
X_pca
array([[ 9.19283683, 1.94858306, -1.12316567],
[ 2.3878018 , -3.76817175, -0.52929196],
[ 5.73389628, -1.0751738 , -0.55174751],
...,
[ 1.25617928, -1.90229671, 0.56273027],
[10.37479406, 1.67201011, -1.87702986],
[-5.4752433 , -0.6706368 , 1.49044385]])
X_svdm
array([[ 9.19283683, 1.94858307, -1.12316615],
[ 2.3878018 , -3.76817174, -0.52929266],
[ 5.73389628, -1.0751738 , -0.55174759],
...,
[ 1.25617928, -1.90229671, 0.56273052],
[10.37479406, 1.67201011, -1.87702935],
[-5.4752433 , -0.67063679, 1.49044309]])

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

How can I resize on time-series data on python

I'm trying to do deep-learning on time-series data.
There are 12 features for each data, but every series data don't have the same amount of data.
Some shape is [48,12], and some is [54,12], I'm trying to resize them into [50,12].
All I know until now is using resize in skimage.transform, but I don't know if it works well or not.
Is there any other solution for doing this?
For example, one of the features in the data looks like below.
The shape is [55, 1] I would like to reshape it to [50, 1].
a = np.array[-5.529309, -4.6293, -3.068647, -4.897388, -4.39951, -4.753769, -3.729291,
-4.973984, -5.060155, -4.686748, -4.696322, -3.939932, -3.470778, -6.209103,
-5.586756, -4.466532, -3.193116, -5.337818, -5.596331, -4.006954, -3.499502,
-3.413331, -6.304848, -4.322914, -4.246317, -5.759098, -5.893142, -6.381444,
-4.52398, -4.198445, -5.634629, -6.276124, -5.17505, -4.322914, -4.198445,
-4.600576, -4.39951, -4.945261, -5.759098, -4.677173, -3.623971, -5.692076,
-6.563361, -5.462287, -4.868664, -5.941015, -6.400594, -5.692076, -4.591002,
-6.027186, -5.960164, -6.256975, -5.414414, -5.730374, -6.726129]
If I using resize, the data will look like below.
Before resize and after resize:
One choice will be using TimeSeriesResampler in tslearn. This resizes a given time series to a fixed size you specified, by resampling data by (linear) interpolation.
https://tslearn.readthedocs.io/en/stable/gen_modules/preprocessing/tslearn.preprocessing.TimeSeriesResampler.html
Example:
from tslearn.preprocessing import TimeSeriesResampler
ts = np.arange(5)
new_ts = TimeSeriesResampler(sz=9).fit_transform(ts)
final_ts = np.squeeze(new_ts)
print(ts) # [0 1 2 3 4]
print(new_ts) # [[[0. ] [0.5] [1. ] [1.5] [2. ] [2.5] [3. ] [3.5] [4. ]]]
print(final_ts) # [0. 0.5 1. 1.5 2. 2.5 3. 3.5 4.]

confused by apply function of GradientBoostingClassifier

For apply function, you can refer to here
My confusion is more from this sample, and I have added some print to below code snippet to output more debug information,
grd = GradientBoostingClassifier(n_estimators=n_estimator)
grd_enc = OneHotEncoder()
grd_lm = LogisticRegression()
grd.fit(X_train, y_train)
test_var = grd.apply(X_train)[:, :, 0]
print "test_var.shape", test_var.shape
print "test_var", test_var
grd_enc.fit(grd.apply(X_train)[:, :, 0])
grd_lm.fit(grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]), y_train_lr)
The output is like below, and confused what are the numbers like 6., 3. and 10. mean? And how they are related to the final classification result?
test_var.shape (20000, 10)
test_var [[ 6. 6. 6. ..., 10. 10. 10.]
[ 10. 10. 10. ..., 3. 3. 3.]
[ 6. 6. 6. ..., 11. 10. 10.]
...,
[ 6. 6. 6. ..., 10. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]
[ 6. 6. 6. ..., 11. 10. 10.]]
To understand gradient boosting, you need first to understand individual trees. I will show a small example.
Here is the setup: a small GB model trained on Iris dataset to predict whether a flower belongs to the class 2.
# import the most common dataset
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
X, y = load_iris(return_X_y=True)
# there are 150 observations and 4 features
print(X.shape) # (150, 4)
# let's build a small model = 5 trees with depth no more than 2
model = GradientBoostingClassifier(n_estimators=5, max_depth=2, learning_rate=1.0)
model.fit(X, y==2) # predict 2nd class vs rest, for simplicity
# we can access individual trees
trees = model.estimators_.ravel()
print(len(trees)) # 5
# there are 150 observations, each is encoded by 5 trees, each tree has 1 output
applied = model.apply(X)
print(applied.shape) # (150, 5, 1)
print(applied[0].T) # [[2. 2. 2. 5. 2.]] - a single row of the apply() result
print(X[0]) # [5.1 3.5 1.4 0.2] - the pbservation corresponding to that row
print(trees[0].apply(X[[0]])) # [2] - 2 is the result of application the 0'th tree to the sample
print(trees[3].apply(X[[0]])) # [5] - 5 is the result of application the 3'th tree to the sample
You can see that each number in the sequence [2. 2. 2. 5. 2.] produced by model.apply() corresponds to an output of a single tree. But what do these numbers mean?
We can easily analyse decision trees by visual examination. Here is a function to plot one
# a function to draw a tree. You need pydotplus and graphviz installed
# sudo apt-get install graphviz
# pip install pydotplus
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
def plot_tree(clf):
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, node_ids=True,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return Image(graph.create_png())
# now we can plot the first tree
plot_tree(trees[0])
You can see that each node has a number (from 0 to 6). If we push our single example into this tree, it will first go to node #1 (because the feature x3 has value 0.2 < 1.75), and then to node #2 (because the feature x2 has value 1.4 < 4.95.
In the same way, we can analyze the tree 3 which has produced the output 5:
plot_tree(trees[3])
Here our observation goes first to node #4 and then to node #5, because x1=3.5>2.25 and x2=1.4<4.85. Thus, it ends up with number 5.
It's that simple! Each number produced by apply() is the ordinal number of the node of the corresponding tree in which the sample ends up.
The relation of these numbers to the final classification result is through the value of the leaves in the corresponding trees. In case of binary classification, the value in all leaves just adds up, and if it is positive, then the 'positive' wins, otherwise the 'negative' class. In case of multiclass classification, the values add up for each class, and the class with the largest total value wins.
In our case, the first tree (with its node #2) gives value -1.454, the other trees also give some values, and total sum of them is -4.84. It is negative, thus, our example does not belong to class 2.
values = [trees[i].tree_.value[int(leaf)][0,0] for i, leaf in enumerate(applied[0].ravel())]
print(values) # [-1.454, -1.05, -0.74, -1.016, -0.58] - the values of nodes [2,2,2,5,2] in the corresponding trees
print(sum(values)) # -4.84 - sum of these values is negative -> this is not class 2

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?
We don't implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest (through forest.estimators_) and count the number of times they fall in the same leaf, i.e., the number of times apply give the same node id for both samples in the pair.
Hope this helps.
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.
def proximityMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return proxMat
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()
model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1. , 0.414, 0.77 , ..., 0.146, 0.79 , 0.002],
## [ 0.414, 1. , 0.362, ..., 0.334, 0.296, 0.008],
## [ 0.77 , 0.362, 1. , ..., 0.218, 0.856, 0. ],
## ...,
## [ 0.146, 0.334, 0.218, ..., 1. , 0.21 , 0.028],
## [ 0.79 , 0.296, 0.856, ..., 0.21 , 1. , 0. ],
## [ 0.002, 0.008, 0. , ..., 0.028, 0. , 1. ]])
There is nothing currently implemented for this in python. I took a first try at it here. It would be great if somebody would be interested in adding these methods to scikit.

Categories