I have a number of sample matrices (mxn) in X_sample.
Each matrix has the same number of rows (same m) but with different number of features (diff n).
Example of matrices in X-sample are: 1000x40, 1000x35, 1000x30,1000x25.
I have the following (much simplified ) code example about my question:
Y_train =
Y_test =
clf = ExtraTreesClassifier(n_estimators=500, max_depth=None,max_features="auto",
min_samples_split=1, random_state=0)
for X_data in X_samples:
X_train = X_data[0]
X_test = X_data[1]
clf.fit(X_train,Y_train)
pred_res = clf.predict(X_test)
.....
I create a classifier outside the loop with the parameter max_features="auto".
I perform different classifications inside the loop using sample matrices with different number of features. My question is if the classifier will adjust the value of max_features based on the actual size of X_train (actual number of features)every time the loop performs the fit operation. The parameter max_featureswith value "auto" should get the actual value equal to the square root of the number of features.
That is, should I have the creation of the classifier outside the loop or inside the loop? Is there a way to read the actual value for the parameter max_features?
Yes.
The fit function does not change the estimator.
See the docs.
Related
So i'm trying to find a simple (not Dijkstra's algorithm) for a shortest path problem.
Without reproducing everything, I have 3 paths and 50 samples of it (i.e. shape (50,3))and I have identified the shortest path for each sample using the min. function
for x_train being
newx_train = np.zeros((50,3))
newx_train[:,0] = p1_train
newx_train[:,1] = p2_train
newx_train[:,2] = p3_train
[x_train] <- just random numbers generated
and subsequently, y_train (since I'm generating it; i pass min function through it)
newy_train[np.arange(newx_train.shape[0]),newx_train.argmin(axis=1)]=1
print(newy_train)
[newy_train] <- passing min will show a 1 for each row where the minimum value is
So i get something like
[[1,0,0],
[0,1,0],
[1,0,0],
[0,0,1]]
Based on x_train, y_train generated, I am trying to implement SVM, logreg to predict how well they perform for multi-class and then i'll compute the classification matrix and accuracy.
My question is, how do i go about using multi-class for logreg? When i run a fit through x_train, y_train; understandably python throws up error that y should be 1-D array but got (50,3) instead.
from sklearn.linear_model import LogisticRegression
LogReg = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial')
LogReg.fit(newx_train,newy_train[:,0])
ylog_pred = LogReg.predict(newx_test)
print(ylog_pred)
The above code naturally works for binary (assuming only 2 paths) since predicting '1' for one column (index 0) would naturally mean the other column is a '0'. But this would not work for multi-class. Could anyone help with it?
I think you're just missing the part with how to interpret the y.
LogisticRegression expects the y column to not be one-hot encoded and to actually be the target labels, so you need something like
newy_train = np.argmax(newy_train, axis=1) # index of max across each row
Then you should be able to fit something with
LogReg.fit(newx_train,newy_train)
The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.
In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):
data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]
sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)
Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:
kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)
for train_index, test_index in kfolder.split(data_sampled):
data_train, data_test = data_sampled[train_index], data_sampled[test_index]
Or the train_test_split method:
data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)
Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.
As it turns out, the problem originates from a combination of sources:
While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.
This is something the documentation points out, thanks to this post:
https://github.com/scikit-learn/scikit-learn/issues/16068
Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of "hard-to-predict" entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I've expected closer results after only one fold, which lead to the discovery of this problem.
Depending on the amount of noise in the data and on the size of the dataset, this could be expected behavior to see scores on out of sample data to deviate by this amount. One split is not guaranteed to be just like any other split, which is why you have 10 in the first place and then average across all results.
What you should trust to be the most generalizable is not any one given split (whether that comes from one of the 10 folds or train_test_split()), but what is far more trustworthy is the average result across all N folds.
Digging deeper into the data could reveal whether there is some reason why one or more splits deviate so much from another. For example, perhaps there is some feature in your data (e.g. "date the sample was collected" and the collection methodology changed from month to month) that makes the data differ from one another in a biased way. If that is the case, you should use a stratified test split (in your CV as well) (see the scikit-learn documentation on that) so you can get a more unbiased grouping of your data.
I'm using the StandardScalar() and lin_reg.coef_ function in the following context:
for i in range(100):
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=i)
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
lin_reg = LinearRegression().fit(x_train, y_train)
if i == 0:
print(lin_reg.coef_)
if i == 1:
print(lin_reg.coef_)
This leads to the following output:
Code Output
So, as have been expected, the coef_ function returns the coefficients for the 22 different features I am passing into the linear regression. However, for the second output, some of the coefficients are way too large (e.g. 1.61e+14). I am pretty sure that the scaling with StandardScaler() works as it should be. However, if I do not scale the training data before applying the coef_ function, I do not get these high coefficients. One important thing that I should mention is that the last 13 features are binary features, whereas the first 9 features are continuous (such as age). I can imagine that the problem is somehow related to this fact, although, for the first binary feature, the coefficients are properly computed (just the last 12 binary features have too large coefficients).
You should use Standardization when the data come from a Gaussian distribution. Using StandardScal() on binary data doesn't make any sense.
You should scale only the first 9 nine variables, and then pass them all in the linear regression.
https://www.atoti.io/when-to-perform-a-feature-scaling/
Avoid scaling binary columns in sci-kit learn StandsardScaler
I'm trying to use a basic Naive-Bayes Classifier in Python using VSC. My attempts all yield 0.0 accuracy.
This is sample data: A CSV without header, of format
class,"['item1','item2','etc']"
The goal is to fit this data to a Multinomial NB model. This is my attempt at it:
df = pandas.read_csv('file.csv', delimiter=',',names=['class','words'],encoding='utf-8')
#x is independent var/feature
X = df.drop('class',axis=1)
#y is dependent var/label
Y = df['class']
#split data into train/test splits, use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.25,random_state = 42)
#create a sparse matrix of words; each word is assigned a number and frequency is counted (i.e. word "x" occurs n amount of times in class Z), rows are classes, columns are words
cv = CountVectorizer()
X_tr = cv.fit_transform(X_train.words)
X_te = cv.transform(X_test.words)
model = MultinomialNB()
model.fit(X_tr,y_train)
y_pred = model.predict(X_te)
print(metrics.accuracy_score(y_test, y_pred))
# accuracy = accuracy_score(y_test,y_pred)*100
# print(accuracy)
As I understand it the following occurs:
A dataframe, df, is created, and split into X and Y (words and classes)
The data's collectively split into training/testing groups
The count vectorizer, CV, assigns an index to each word and counts how many times a certain word occurs in a certain class (word occurences as numbers)
A Multinomial model is created and fit with the training data (x_train.words is used so as the "words" label is ignored)
the model is tested with testing data and an accuracy score is printed.
I've already tried:
Checking the shape of the x_test and x_train dataframe: they match like I think they should, with an equal amount of columns (words), and a 6:3 ratio of rows (classes, per the train test split)
Checking the variable types: the training and testing x's are all sparse matrices (<class 'scipy.sparse.csr.csr_matrix'>) and the testing/training y's are, per the parameters of model.fit, array-like shapes of n samples (pandas series).
The Issue is that the accuracy is 0.0, meaning something's wrong. Perhaps the greater issue is that I have no idea what.
The problem is that the length of your whole data frame is just 9. Just 9 rows. So your model doesn't learn anything. Also, I checked your dataset and I don't think you can make a sentence classifier from it as there are no sentences in your dataset.
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.