sklearn selectKbest: which variables were chosen? - python

I'm trying to get sklearn to select the best k variables (for example k=1) for a linear regression. This works and I can get the R-squared, but it doesn't tell me which variables were the best. How can I find that out?
I have code of the following form (real variable list is much longer):
X=[]
for i in range(len(df)):
X.append([averageindegree[i],indeg3_sum[i],indeg5_sum[i],indeg10_sum[i])
training=[]
actual=[]
counter=0
for fold in range(500):
X_train, X_test, y_train, y_test = crossval.train_test_split(X, y, test_size=0.3)
clf = LinearRegression()
#clf = RidgeCV()
#clf = LogisticRegression()
#clf=ElasticNetCV()
b = fs.SelectKBest(fs.f_regression, k=1) #k is number of features.
b.fit(X_train, y_train)
#print b.get_params
X_train = X_train[:, b.get_support()]
X_test = X_test[:, b.get_support()]
clf.fit(X_train,y_train)
sc = clf.score(X_train, y_train)
training.append(sc)
#print "The training R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"
sc = clf.score(X_test, y_test)
actual.append(sc)
#print "The actual R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"

You need to use get_support:
features_columns = [.......]
fs = SelectKBest(score_func=f_regression, k=5)
print zip(fs.get_support(),features_columns)

Try using b.fit_transform() instead of b.tranform(). the fit_transform() function with fit and transform your input X to new X with selected features and return the new X.
...
b = fs.SelectKBest(fs.f_regression, k=1) #k is number of features.
X_train = b.fit_transform(X_train, y_train)
#print b.get_params
...

The way to do it is to configure SelectKBest with your favourite function (regression in your case), and then to get the params out of it.
My code assumes you have a list features_list that contains the names of all the headlines of X.
kb = SelectKBest(score_func=f_regression, k=5) # configure SelectKBest
kb.fit(X, Y) # fit it to your data
# get_support gives a vector [False, False, True, False....]
print(features_list[kb.get_support()])
Certainly you can write it more pythonic than me :-)

Related

Logistic regression - how to fit a model with multiple features and show coefficients

I fit a logistic regression with 1 or 2 features:
X = df[["decile_score", "age"]]
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.20, random_state=100
)
logistic_age_model = linear_model.LogisticRegression()
logistic_age_model.fit(X_train, y_train)
beta_0 = logistic_age_model.intercept_[0]
beta_1, beta_2 = logistic_age_model.coef_[0]
print(f"Fit model: p(recid) = L({beta_0:.4f} + {beta_1:.4f} decile_score + {beta_2:.4f} age)")
I have more than 2 features (15 for example), how can I write the fit model to see the change?
For example
Fit model: p(recid) = L(-0.8480 + 0.2475 decile_score + -0.0135 age) I want to see how each 15 feature will be affect the result.
Do I need to declare a beta for each coefficient, and if that's the case, how can I do it?
I think you are looking for a more efficient way to print the Logistic formula for many variables.
# Initialize model
logistic_age_model = LogisticRegression()
# Fit model (X now has 15 features)
logistic_age_model.fit(X, y)
# List of coeficient values
coefs = np.union1d(logistic_age_model.intercept_, logistic_age_model.coef_[0]).tolist()
# List of names
betas = ['beta_' + str(i) for i in range(len(coefs))]
# Combine `coefs` & `betas` to form a dictionary
d = dict(zip(betas, coefs))
# Print as formula
print('L(' + str(d) + ')')

See the score of each fold when cross validating a model using a for loop

I want to see the individual score of each fitted model to visualize the strength of cross validation (I am doing this to show my coworkers why cross validation is important).
I have a .csv file with 500 rows, 200 independent variables and 1 binary target. I defined skf to fold the data 5 times using StratifiedKFold.
My code looks like this:
X = data.iloc[0:500, 2:202]
y = data["target"]
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = svm.SVC(kernel = "linear")
Scores = [0] * 5
for i, j in skf.split(X, y):
X_train, y_train = X.iloc[i], y.iloc[i]
X_test, y_test = X.iloc[j], y.iloc[j]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
As you can see, I assigned a list of 5 zeroes to Scores. I would like to assign the clf.score(X_test, y_test) of each of the 5 predictions to the list. However, the indices i and j are not {1, 2, 3, 4, 5}. Rather, they are row numbers used to fold the X and y data frames.
How can I assign the test scores of each of the k fitted models into Scoreswithin this loop? Do I need a separate index for this?
I know using cross_val_score literally does all this and gives you a geometric average of the k scores. However, I want to show my coworkers what happens behind the cross validation functions that come in the sklearn library.
Thanks in advance!
If I understood the question, and you don't need any particular indexing for Scores:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
X = np.random.normal(size = (500, 200))
y = np.random.randint(low = 0, high=2, size=500)
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = SVC(kernel = "linear")
Scores = []
for i, j in skf.split(X, y):
X_train, y_train = X[i], y[i]
X_test, y_test = X[j], y[j]
clf.fit(X_train, y_train)
Scores.append(clf.score(X_test, y_test))
The result is:
>>>Scores
[0.5247524752475248, 0.53, 0.5, 0.51, 0.4444444444444444]

Grading System - Input Features

I am working on a Grading System ( graduation project ). I have preprocessed the data, then used TfidfVectorizer on the data and used LinearSVC to fit the model.
The System goes as follows, it has 265 definitions, of arbitrary lengths; but in total, they sum up to shape of (265, 8581 )
so when I try to input some new random sentence to predict against it, I get this message
Error Message
you could have a look at the code used ( Full & long ) if you want to;
Code used;
def normalize(df):
lst = []
for x in range(len(df)):
text = re.sub(r"[,.'!?]",'', df[x])
lst.append(text)
filtered_sentence = ' '.join(lst)
return filtered_sentence
def stopWordRemove(df):
stop = stopwords.words("english")
needed_words = []
for x in range(len(df)):
words = word_tokenize(df)
for word in words:
if word not in stop:
needed_words.append(word)
return needed_words
def prepareDataSets(df):
sentences = []
for index, d in df.iterrows():
Definitions = stopWordRemove(d['Definitions'].lower())
Definitions_normalized = normalize(Definitions)
if d['Results'] == 'F':
sentences.append([Definitions, 'false'])
else:
sentences.append([Definitions, 'true'])
df_sentences = DataFrame(sentences, columns=['Definitions', 'Results'])
for x in range(len(df_sentences)):
df_sentences['Definitions'][x] = ' '.join(df_sentences['Definitions'][x])
return df_sentences
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
return tfidf_data
def learning(clf, X, Y):
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(X,Y, test_size=.2,random_state=43)
classifier = clf()
classifier.fit(X_train, Y_train)
predict = cross_validation.cross_val_predict(classifier, X_test, Y_test, cv=5)
scores = cross_validation.cross_val_score(classifier, X_test, Y_test, cv=5)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (classifier, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Then I run these scripts : which I get the mentioned error after
test = LinearSVC()
data, target = preprocessed_df['Definitions'], preprocessed_df['Results']
tfidf_data = featureExtraction(data)
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(tfidf_data,target, test_size=.2,random_state=43)
test.fit(tfidf_data, target)
predict = cross_validation.cross_val_predict(test, X_test, Y_test, cv=10)
scores = cross_validation.cross_val_score(test, X_test, Y_test, cv=10)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (test, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Xnew = ["machine learning is playing games in home"]
tvect = TfidfVectorizer(min_df=1, max_df=1.0, ngram_range=(1,3))
X_test= tvect.fit_transform(Xnew)
ynew = test.predict(X_test)
You never call fit_transform() on test, only transform() and use the same vectorizer which is used on training data.
Do this:
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
# Here I am returning the vectorizer as well, which was used to generate the training data
return vectorizer, tfidf_data
...
...
tfidf_vectorizer, tfidf_data = featureExtraction(data)
...
...
# Now using the same vectorizer on test data
X_test= tfidf_vectorizer.transform(Xnew)
...
In your code, you are using a new TfidfVectorizer which obviously will not know about the training data and also not know that training data has 8581 features.
The test data should be prepared in the same way as you prepare the train data, always. Else even if you not get error, the results are wrong and model will not perform like that in real case scenarios.
See my other answers explaining similar situation for different feature preprocessing techniques:
https://stackoverflow.com/a/47205199/3374996
https://stackoverflow.com/a/50461140/3374996
https://stackoverflow.com/a/44671967/3374996
I would have tagged this question as a duplicate of one of these, but seeing you are using a new vectorizer altogether and have a different method for transforming train data, I answered this. From next time, please search the issue first and try understanding whats happening in similar scenarios, before posting a question.

TypeError: 'KFold' object is not iterable

I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.
I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.
The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).
def printing_Kfold_scores(x_train_data,y_train_data):
fold = KFold(len(y_train_data),5,shuffle=False)
# Different C parameters
c_param_range = [0.01,0.1,1,10,100]
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range
# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
for iteration, indices in enumerate(fold,start=1):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
# with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)
# The mean value of those recall scores is the metric we want to save and get hold of.
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
# Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
The errors I'm getting are as follows:
for this line: fold = KFold(len(y_train_data),5,shuffle=False)
Error:
TypeError: init() got multiple values for argument 'shuffle'
if I remove the shuffle=False from this line, I'm getting the following error:
TypeError: shuffle must be True or False; got 5
If I remove the 5 and keep the shuffle=False, I'm getting the following error;
TypeError: 'KFold' object is not iterable
which is from this line: for iteration, indices in enumerate(fold,start=1):
If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.
Thanks.
That depends on how you have imported the KFold.
If you have did this:
from sklearn.cross_validation import KFold
Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle
But if you are doing this:
from sklearn.model_selection import KFold
then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate().
By the way, the model_selection is the new module and recommended to use. Try using it like this:
fold = KFold(5,shuffle=False)
for train_index, test_index in fold.split(X):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
recall_accs.append(recall_acc)
KFold is a splitter, so you have to give something to split.
example code:
X = np.array([1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4]])
y = np.array([1, 2, 3, 4])
# Now you create your Kfolds by the way you just have to pass number of splits and if you want to shuffle.
fold = KFold(2,shuffle=False)
# For iterate over the folds just use split
for train_index, test_index in fold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Follow fitting the classifier
If you want to get the index for the loop of train/test, just add enumerate
for i, train_index, test_index in enumerate(fold.split(X)):
print('Iteration:', i)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I hope this works

numpy: How can I select specific indexes in an np array for k-fold cross validation?

I have a training data set in matrix form of dimensions 5000 x 3027 (CIFAR-10 data set). Using array_split in numpy, I partitioned it into 5 different parts, and I want to select just one of the parts as the cross validation fold. However my problem comes when I use something like
XTrain[[Indexes]] where indexes is an array like [0,1,2,3], because doing this gives me a 3D tensor of dimensions 4 x 1000 x 3027, and not a matrix. How do I collapse the "4 x 1000" into 4000 rows, to get a matrix of 4000 x 3027?
for fold in range(len(X_train_folds)):
indexes = np.delete(np.arange(len(X_train_folds)), fold)
XTrain = X_train_folds[indexes]
X_cv = X_train_folds[fold]
yTrain = y_train_folds[indexes]
y_cv = y_train_folds[fold]
classifier.train(XTrain, yTrain)
dists = classifier.compute_distances_no_loops(X_cv)
y_test_pred = classifier.predict_labels(dists, k)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct/num_test)
k_to_accuracy[k] = accuracy
Perhaps you can try this instead (new to numpy so if I am doing something inefficient/wrong, would be happy to be corrected)
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
for k in k_choices:
k_to_accuracies[k] = []
for i in range(num_folds):
training_data, test_data = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:]), X_train_folds[i]
training_labels, test_labels = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:]), y_train_folds[i]
classifier.train(training_data, training_labels)
predicted_labels = classifier.predict(test_data, k)
k_to_accuracies[k].append(np.sum(predicted_labels == test_labels)/len(test_labels))
I would suggest using scikit-learn package. It already comes with plenty of common machine learning tools, such as K-fold cross-validation generator:
>>> from sklearn.cross_validation import KFold
>>> X = # your data [samples x features]
>>> y = # gt labels
>>> kf = KFold(X.shape[0], n_folds=5)
And then, iterate through kf:
>>> for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# do something
The above loop will be executed n_folds times, each time with different training and testing indexes.

Categories