I'm using the Linear Regression model from Scikit Learn to an explanatory fit on a time series:
from sklearn import linear_model
import numpy as np
X = np.array([np.random.random(100), np.random.random(100)])
y = np.array(np.random.random(100))
regressor = linear_model.LinearRegression()
regressor.fit(X, y)
y_hat = regressor.predict(X)
I want do cross-validate the the prediction. As far as I know, I can't use the cross_val from sklearn (like Kfold) because it will break down the results randomly, and I need that the folds are sequentially. For example,
data_set = [1 2 3 4 5 6 7 8 9 10]
# first train set
train = [1]
# first test set
test = [2 3 4 5 6 7 8 9 10]
#fit, predict, evaluate
# train set
train = [1 2]
# test set
test = [3 4 5 6 7 8 9 10]
#fit, predict, evaluate
...
# train set
train = [1 2 3 4 5 6 7 8]
# test set
test = [9 10]
#fit, predict, evaluate
Is it possible to do this using sklearn?
You do not need scikit for this kind of folding. Slicing is sufficient, something like:
step = 1
for i in range(0, len(data_set), step):
train = dataset[:i]
test = dataset[i:]
# etc...
Related
I use StratifiedKFold and a form of grid search for my Logistic Regression.
skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
I call this for loop for each combination of parameters:
for fold, (trn_idx, test_idx) in enumerate(skf.split(X, y)):
My question is, are trn_idx and test_idx the same for each fold every time I run the loop?
For example, if fold0 contains trn_dx = [1,2,5,7,8] and test_idx = [3,4,6], is fold0 going to contain the same trn_idx and test_idx the next 5 times I run the loop?
Yes, the stratified k-fold split is fixed if random_state=SEED is fixed. The shuffle only shuffles the dataset along with their targets before the k-fold split.
This means that each fold will always have their indexes:
x = list(range(10))
y = [1]*5 + [2]*5
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
for fold, (trn_idx, test_idx) in enumerate(skf.split(x, y)):
print(trn_idx, test_idx)
Output:
[1 2 4 5 7 9] [0 3 6 8]
[0 1 3 5 6 8 9] [2 4 7]
[0 2 3 4 6 7 8] [1 5 9]
No matter how may times I run this code.
Dataset file : google drive link
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
1
1
7
2
3
8
0
4
0
6
0
5
2
7
8
1
2
4
6
5
9
10
3
0
3
0
0
0
0
1
5
2
3
4
0
6
4
1
7
2
3
8
0
5
0
6
0
4
5
0
4
7
0
6
1
5
3
0
0
2
6
1
0
2
3
0
5
4
0
0
6
7
Here the column userid represents: STUDENTS
and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset.
There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
Here's the complete dataset shape information
I'm Applying PCA. and I'm getting an error which is :
ValueError: Found array with 0 feature(s) (shape=(22307, 0)) while a minimum of 1 is required.
As I stated there are 27844 users & 8932 other columns
What I have tried so far
df3 = pd.read_feather('Bundles.ftr')
X = df3['user_iD']
y = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
How do I apply PCA to this use case?
Here's how to use PCA to pre-process your data:
df3 = pd.read_feather('Bundles.ftr')
X = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X into the
# Training set and Testing set
X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 0)
# performing preprocessing part
X_train = X_train.values
X_test = X_test.values
# Applying PCA function on training
# and testing set of X component
print(X_train.shape)
print(X_test.shape)
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
This is what the X_train variable looks like after preprocessing:
array([[-1846.8651992 , 437.17734222],
[-1847.05838019, 437.41158726],
[-1845.67443438, 436.28046074],
...,
[-1847.00651974, 437.20374889],
[ -780.18296423, 116.65908052],
[-1847.09404683, 437.30545959]])
However, I don't think that PCA is the right tool here. There are a few reasons for this:
I think kNN would be easier to interpret.
The way that the input weights are encoded is combining ordinal information and categorical information, which will make your clustering algorithm not work as well.
For example, if one user read a chapter first, and another user doesn't read a chapter at all, that is assigned 1 and 0. In this case, higher number means the user is more interested.
In another case, if one user read a chapter seventh, and another user read it eighth, that is assigned 7 and 8. In this case, a higher number means that the user is less interested.
On top of that, you're saying that the difference between reading something seventh or eighth is the same as the difference as between not reading it at all. To me, if someone didn't read it, that's a much bigger difference than a slight change in reading order.
So I would suggest having two sets of input features: did they read it at all, and if they did, where in their reading did the chapter fall.
The first set of features could be computed like this:
did_read = (X.values >= 1).astype(int)
These features are 1 if read and 0 otherwise.
The second set of features could be computed like this:
X_values = X.values
max_order = X_values.max(axis=1, initial=1).reshape(-1, 1)
order_normalized = X_values / max_order
These features are in the range [0, 1] based on whether it was toward the beginning or end of the chapters that they read.
I'm trying (badly) to use sklearn's LOO functionality and what I would like to do is append each training split set into a dataframe column with a label for the split index. So using the example from the sklearn page, but slightly modified:
import numpy as np
from sklearn.model_selection import LeaveOneOut
x = np.array([1,2])
y = np.array([3,4])
coords = np.column_stack((x,y))
z = np.array([8, 12])
loo = LeaveOneOut()
loo.get_n_splits(coords)
print(loo)
LeaveOneOut()
for train_index, test_index in loo.split(coords):
print("TRAIN:", train_index, "TEST:", test_index)
XY_train, XY_test = coords[train_index], coords[test_index]
z_train, z_test = z[train_index], z[test_index]
print(XY_train, XY_test, z_train, z_test)
Which returns:
TRAIN: [1] TEST: [0]
[[2 4]] [[1 3]] [12] [8]
TRAIN: [0] TEST: [1]
[[1 3]] [[2 4]] [8] [12]
In my case I'd like to write each split value to a dataframe like this:
X Y Ztrain Ztest split
0 1 2 8 12 0
1 3 4 8 12 0
2 1 2 12 8 1
3 3 4 12 8 1
And so on.
The motivation for doing this is I want to try a jackknifing interpolation of sparse point data. Ideally I want to run an interpolation/gridder on each of the LOO training sets, and then stack them. But I am struggling to access each train set to then use in something like griddata
Any help would be appreciated, for the problem here or the approach in general.
I don't quite get the logic of your dataframe, but you can try something like below to get your dataframe:
df = []
for train_index, test_index in loo.split(coords):
x = pd.DataFrame({'XY_train':coords[train_index][0],\
'XY_test':coords[test_index][0],\
'Ztrain':z[train_index][0],\
'Ztest':z[test_index][0]})
df.append(x)
df = pd.concat(df)
df
XY_train XY_test Ztrain Ztest
0 2 1 12 8
1 4 3 12 8
0 1 2 8 12
1 3 4 8 12
As we know the Bernoulli Naive Bayes Classifier uses binary predictors (features). The thing I am not getting is how BernoulliNB in scikit-learn is giving results even if the predictors are not binary. The following example is taken verbatim from the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)
print(clf.predict(X[2:3]))
Output:
array([3])
Here are the first 10 features of X, and they are obviously not binary:
3 4 0 1 3 0 0 1 4 4 1
1 0 2 4 4 0 4 1 4 1 0
2 4 4 0 3 3 0 3 1 0 2
2 2 3 1 4 0 0 3 2 4 1
0 4 0 3 2 4 3 2 4 2 4
3 3 3 3 0 2 3 1 3 2 3
How does BernoulliNB work here even though the predictors are not binary?
This is due to the binarize argument; from the docs:
binarize : float or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
When called with its default value binarize=0.0, as is the case in your code (since you do not specify it explicitly), it will result in converting every element of X greater than 0 to 1, hence the transformed X that will be used as the actual input to the BernoulliNB classifier will consist indeed of binary values.
The binarize argument works exactly the same way with the stand-alone preprocessing function of the same name; here is a simplified example, adapting your own:
from sklearn.preprocessing import binarize
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 1))
X
# result
array([[3],
[4],
[0],
[1],
[3],
[0]])
binarize(X) # here as well, default threshold=0.0
# result (binary values):
array([[1],
[1],
[0],
[1],
[1],
[0]])
I have a trained Scikit Kmean model.
When using the models predict-function, the model assigns a given data point to the nearest cluster. (As expected)
What is the easiest method to instead have the model assign the data point to the SECOND nearest, or THIRD nearest cluster?
I cannot seem to find this anywhere. (I might be missing something essential.)
The Kmeans classifier has a transform(X) method that returns the distance of each record to the centroids of each cluster, in the form of an array with the shape [n_observations, n_clusters].
With that, you can pick which cluster to assign the records to.
Example:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
np.random.seed(42)
digits = load_digits()
data = scale(digits.data)
n_digits = len(np.unique(digits.target))
km = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
km.fit(data)
predicted = km.predict(data)
dist_centers = km.transform(data)
To validate the transform output, we can compare the result of predict to taking the minimum value of the centroid distances:
>>> np.allclose(km.predict(data), np.argmin(dist_centers, axis=1))
True
Finally, we can use np.argsort to get the index of the sorted elements of each row in the distances array in such a way that the first column of the result corresponds to the labels of the nearest clusters, the second column corresponds to the labels of the second nearest clusters, and so on.
>>> print(predicted)
[0 3 3 ... 3 7 7]
>>> print(np.argsort(dist_centers, axis=1))
[[0 7 4 ... 8 6 5]
[3 9 4 ... 6 0 5]
[3 9 4 ... 8 6 5]
...
[3 1 9 ... 8 6 5]
[7 0 9 ... 8 6 5]
[7 3 1 ... 9 6 5]]