Grading System - Input Features - python

I am working on a Grading System ( graduation project ). I have preprocessed the data, then used TfidfVectorizer on the data and used LinearSVC to fit the model.
The System goes as follows, it has 265 definitions, of arbitrary lengths; but in total, they sum up to shape of (265, 8581 )
so when I try to input some new random sentence to predict against it, I get this message
Error Message
you could have a look at the code used ( Full & long ) if you want to;
Code used;
def normalize(df):
lst = []
for x in range(len(df)):
text = re.sub(r"[,.'!?]",'', df[x])
lst.append(text)
filtered_sentence = ' '.join(lst)
return filtered_sentence
def stopWordRemove(df):
stop = stopwords.words("english")
needed_words = []
for x in range(len(df)):
words = word_tokenize(df)
for word in words:
if word not in stop:
needed_words.append(word)
return needed_words
def prepareDataSets(df):
sentences = []
for index, d in df.iterrows():
Definitions = stopWordRemove(d['Definitions'].lower())
Definitions_normalized = normalize(Definitions)
if d['Results'] == 'F':
sentences.append([Definitions, 'false'])
else:
sentences.append([Definitions, 'true'])
df_sentences = DataFrame(sentences, columns=['Definitions', 'Results'])
for x in range(len(df_sentences)):
df_sentences['Definitions'][x] = ' '.join(df_sentences['Definitions'][x])
return df_sentences
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
return tfidf_data
def learning(clf, X, Y):
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(X,Y, test_size=.2,random_state=43)
classifier = clf()
classifier.fit(X_train, Y_train)
predict = cross_validation.cross_val_predict(classifier, X_test, Y_test, cv=5)
scores = cross_validation.cross_val_score(classifier, X_test, Y_test, cv=5)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (classifier, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Then I run these scripts : which I get the mentioned error after
test = LinearSVC()
data, target = preprocessed_df['Definitions'], preprocessed_df['Results']
tfidf_data = featureExtraction(data)
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(tfidf_data,target, test_size=.2,random_state=43)
test.fit(tfidf_data, target)
predict = cross_validation.cross_val_predict(test, X_test, Y_test, cv=10)
scores = cross_validation.cross_val_score(test, X_test, Y_test, cv=10)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (test, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Xnew = ["machine learning is playing games in home"]
tvect = TfidfVectorizer(min_df=1, max_df=1.0, ngram_range=(1,3))
X_test= tvect.fit_transform(Xnew)
ynew = test.predict(X_test)

You never call fit_transform() on test, only transform() and use the same vectorizer which is used on training data.
Do this:
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
# Here I am returning the vectorizer as well, which was used to generate the training data
return vectorizer, tfidf_data
...
...
tfidf_vectorizer, tfidf_data = featureExtraction(data)
...
...
# Now using the same vectorizer on test data
X_test= tfidf_vectorizer.transform(Xnew)
...
In your code, you are using a new TfidfVectorizer which obviously will not know about the training data and also not know that training data has 8581 features.
The test data should be prepared in the same way as you prepare the train data, always. Else even if you not get error, the results are wrong and model will not perform like that in real case scenarios.
See my other answers explaining similar situation for different feature preprocessing techniques:
https://stackoverflow.com/a/47205199/3374996
https://stackoverflow.com/a/50461140/3374996
https://stackoverflow.com/a/44671967/3374996
I would have tagged this question as a duplicate of one of these, but seeing you are using a new vectorizer altogether and have a different method for transforming train data, I answered this. From next time, please search the issue first and try understanding whats happening in similar scenarios, before posting a question.

Related

Get the error "All intermediate steps should be transformers and implement fit and transform or be the string passthrough" and I can't resolve it

I'm trying to create a training and predicting pipeline that allows me to train models using various sizes of training data and perform predictions on the testing data, I wrote a function:
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
results = {}
learner = Pipeline(steps=[('tree', DecisionTreeClassifier()),
('logistic', LogisticRegression()),
('naive', MultinomialNB())])
learner.fit(X_train, y_train)
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
results['acc_test'] = accuracy_score(y_test, predictions_test)
results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
return results
and then wrote that code:
clf_A = DecisionTreeClassifier()
clf_B = LogisticRegression()
clf_C = MultinomialNB()
samples_100 = len(y_train)
samples_10 = int(len(y_train) * 0.1)
samples_1 = int(len(y_train) * 0.01)
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = \
train_predict(clf, samples, X_train, y_train, X_test, y_test)
vs.evaluate(results, accuracy, fscore)
but got the error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'DecisionTreeClassifier()' (type <class 'sklearn.tree._classes.DecisionTreeClassifier'>) doesn't
I tried a lot and did not find a solution, can you help me?

Retrieving same output for different instances for XGBoost regression algorithm

I have the following data using the XGBoost regression algorithm to perform prediction. The problem is, however, that the regression algorithm predicts the same output for any input and I'm not really sure why.
data= pd.read_csv("depthwise_data.csv", delimiter=',', header=None, skiprows=1, names=['input_size','input_channels','conv_kernel','conv_strides','running_time'])
X = data[['input_size', 'input_channels','conv_kernel', 'conv_strides']]
Y = data[["running_time"]]
X_train, X_test, y_train, y_test = train_test_split(
np.array(X), np.array(Y), test_size=0.2, random_state=42)
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)
xgb_depth_conv = xgb.XGBRegressor(objective ='reg:squarederror',
n_estimators = 1000,
seed = 123,
tree_method = 'hist',
max_depth=10)
xgb_depth_conv.fit(X_train, y_train_log)
y_pred_train = xgb_depth_conv.predict(X_train)
#y_pred_test = xgb_depth_conv.predict(X_test)
X_data=[[8,576,3,2]] #instance
X_test=np.log(X_data)
y_pred_test=xgb_depth_conv.predict(X_test)
print(np.exp(y_pred_test))
MSE_test, MSE_train = mse(y_test_log,y_pred_test), mse(y_train_log, y_pred_train)
R_squared = r2_score(y_pred_test,y_test_log)
print("MSE-Train = {}".format(MSE_train))
print("MSE-Test = {}".format(MSE_test))
print("R-Squared: ", np.round(R_squared, 2))
Output for first instance
X_data=[[8,576,3,2]]
print(np.exp(y_pred_test))
[0.7050679]
Output for second instance
X_data=[[4,960,3,1]]
print(np.exp(y_pred_test))
[0.7050679]
Your problem stems from this X_test=np.log(X_data)
Why are you applying log on the test cases while you have not applied it on the training samples?
If you take away the np.log completely, even from the target (y), you get really good results. I tested it myself with the data you provided us with.

ValueError: Input has n_features=10 while the model has been trained with n_features=4261

I am trying to use trained BoW, tfidf, and SVM model to do prediction:
def bagOfWords(files_data):
count_vector = sklearn.feature_extraction.text.CountVectorizer()
return count_vector.fit_transform(files_data)
files = sklearn.datasets.load_files(dir_path)
word_counts = util.bagOfWords(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
clf = sklearn.svm.LinearSVC()
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)
I can run following:
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
But following will get error:
clf.fit(X_train, y_train)
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
I think am already using the former tf_transform, and don't know why still got the error. Any help is greatly appreciated!
You're not preserving the CountVectorizer you originally fit the data with.
This bagOfWords call is fitting a separate CountVectorizer in its own scope.
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
You want to use the one you fit on your training set.
You are also training your transformers with the entire X, including X_test. You want to exclude your test test from any training, including transformations.
Try something like this.
files = sklearn.datasets.load_files(dir_path)
# Split in train/test
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(files.data, file.target)
# Fit and tranform with X_train
count_vector = sklearn.feature_extraction.text.CountVectorizer()
word_counts = count_vector.fit_transform(X_train)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True)
X_train = tf_transformer.fit_transform(word_counts)
clf = sklearn.svm.LinearSVC()
clf.fit(X_train, y_train)
# Transform X_test
test_word_counts = count_vector.transform(X_test)
ready_to_be_predicted = tf_transformer.transform(test_word_counts)
X_test = clf.predict(ready_to_be_predicted)
# Test example
new_word_counts = count_vector.transform["a place to listen to music it smaking its way to the us"])
ready_to_be_predicted = tf_transformer.transform(new_word_counts)
predicted = clf.predict(ready_to_be_predicted)
Of course, it's much less complicated to combine these transformers into a Pipeline.
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

How to use SVM in scikit learn to classify different test data for review spam detection

I am doing review spam detection using SVM in scikit learn. for this task i am using gold standard data set of truthful and deceptive reviews of each 400. Now i have done so far is to train and test split of this same dataset and find accuracy.
Now I want to train my SVM classifier using this dataset and then want to classify my new downloaded test data different then original data set.
How can I do this task. My code so far is:
def main():
init();
dir_path ='C:\spam\hotel-reviews'
files = sklearn.datasets.load_files(dir_path)
model = CountVectorizer()
X_train = model.fit_transform(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
#print X
print '\n\n'
# create classifier
clf = sklearn.svm.LinearSVC()
# test the classifier
test_classifier(X, files.target, clf, test_size=0.2, y_names=files.target_names, confusion=False)
def test_classifier(X, y, clf, test_size=0.3, y_names=None, confusion=False):
#train-test split
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
if __name__ == '__main__':
main()
Now i want to classify my own different review data of 500 reviews in reviews.txt file using above trained classifier, so how can i do this?
To score your data two steps are needed.
Either return clf and usea separate method for scoring or you can use within same method. This is the workflow
def scoreData(clf):
x_for_predict = loadScoringData("reviews.txt") # Signature only. assuming same data format without target variable
y_predict = clf.predict(x_for_predict)
plotResults(clf, y_predict)# just a signature.

sklearn selectKbest: which variables were chosen?

I'm trying to get sklearn to select the best k variables (for example k=1) for a linear regression. This works and I can get the R-squared, but it doesn't tell me which variables were the best. How can I find that out?
I have code of the following form (real variable list is much longer):
X=[]
for i in range(len(df)):
X.append([averageindegree[i],indeg3_sum[i],indeg5_sum[i],indeg10_sum[i])
training=[]
actual=[]
counter=0
for fold in range(500):
X_train, X_test, y_train, y_test = crossval.train_test_split(X, y, test_size=0.3)
clf = LinearRegression()
#clf = RidgeCV()
#clf = LogisticRegression()
#clf=ElasticNetCV()
b = fs.SelectKBest(fs.f_regression, k=1) #k is number of features.
b.fit(X_train, y_train)
#print b.get_params
X_train = X_train[:, b.get_support()]
X_test = X_test[:, b.get_support()]
clf.fit(X_train,y_train)
sc = clf.score(X_train, y_train)
training.append(sc)
#print "The training R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"
sc = clf.score(X_test, y_test)
actual.append(sc)
#print "The actual R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"
You need to use get_support:
features_columns = [.......]
fs = SelectKBest(score_func=f_regression, k=5)
print zip(fs.get_support(),features_columns)
Try using b.fit_transform() instead of b.tranform(). the fit_transform() function with fit and transform your input X to new X with selected features and return the new X.
...
b = fs.SelectKBest(fs.f_regression, k=1) #k is number of features.
X_train = b.fit_transform(X_train, y_train)
#print b.get_params
...
The way to do it is to configure SelectKBest with your favourite function (regression in your case), and then to get the params out of it.
My code assumes you have a list features_list that contains the names of all the headlines of X.
kb = SelectKBest(score_func=f_regression, k=5) # configure SelectKBest
kb.fit(X, Y) # fit it to your data
# get_support gives a vector [False, False, True, False....]
print(features_list[kb.get_support()])
Certainly you can write it more pythonic than me :-)

Categories