I am trying to make a sentiment analyser using the scikit-learn LinearSVC classifier. The problem is that the classifier is classifying every sentence as a positive. Another question is - why is the function predict() returning me a list of the classified label for every text? I thought that it should return only one text/number which is the classified label. Here is a sample cut from the code.
vectorizer = TfidfVectorizer(input='content', decode_error='ignore')
vect_train_x = vectorizer.fit_transform(training_data) # this is actually a list of sentences
scaler = StandardScaler(with_mean=False) # I don't know why it should be False
X_train = scaler.fit_transform(vect_train_x) # compute mean, std and transform training data as well
vect_test_x = vectorizer.transform(test) # the sentence that needs to be classified
X_test = scaler.transform(vect_test_x)
clf = LinearSVC()
clf.fit(X_train, labels)
print vect_test_x
print clf.predict(X_test) # returning me a list of Positive => ['Positive' 'Positive' 'Positive' 'Positive' 'Positive' 'Positive']
I would be very grateful if you explain me what exactly I am not understanding. I tried to read the documentation but without any examples I could not understand it. My training data consists of 100 000 positive and 100 000 negative sentences.
Came across this, I had the same issue, what solved my problem was to convert X_test to a list first, then to np.array which then needs to be pass to the 'predict' function
new_array = []
new_array.append(Input) #Input is string if reading from a file or from input()
X_test = np.array(new_array)
print clf.predict(X_test)
Related
I'm trying to use a basic Naive-Bayes Classifier in Python using VSC. My attempts all yield 0.0 accuracy.
This is sample data: A CSV without header, of format
class,"['item1','item2','etc']"
The goal is to fit this data to a Multinomial NB model. This is my attempt at it:
df = pandas.read_csv('file.csv', delimiter=',',names=['class','words'],encoding='utf-8')
#x is independent var/feature
X = df.drop('class',axis=1)
#y is dependent var/label
Y = df['class']
#split data into train/test splits, use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.25,random_state = 42)
#create a sparse matrix of words; each word is assigned a number and frequency is counted (i.e. word "x" occurs n amount of times in class Z), rows are classes, columns are words
cv = CountVectorizer()
X_tr = cv.fit_transform(X_train.words)
X_te = cv.transform(X_test.words)
model = MultinomialNB()
model.fit(X_tr,y_train)
y_pred = model.predict(X_te)
print(metrics.accuracy_score(y_test, y_pred))
# accuracy = accuracy_score(y_test,y_pred)*100
# print(accuracy)
As I understand it the following occurs:
A dataframe, df, is created, and split into X and Y (words and classes)
The data's collectively split into training/testing groups
The count vectorizer, CV, assigns an index to each word and counts how many times a certain word occurs in a certain class (word occurences as numbers)
A Multinomial model is created and fit with the training data (x_train.words is used so as the "words" label is ignored)
the model is tested with testing data and an accuracy score is printed.
I've already tried:
Checking the shape of the x_test and x_train dataframe: they match like I think they should, with an equal amount of columns (words), and a 6:3 ratio of rows (classes, per the train test split)
Checking the variable types: the training and testing x's are all sparse matrices (<class 'scipy.sparse.csr.csr_matrix'>) and the testing/training y's are, per the parameters of model.fit, array-like shapes of n samples (pandas series).
The Issue is that the accuracy is 0.0, meaning something's wrong. Perhaps the greater issue is that I have no idea what.
The problem is that the length of your whole data frame is just 9. Just 9 rows. So your model doesn't learn anything. Also, I checked your dataset and I don't think you can make a sentence classifier from it as there are no sentences in your dataset.
I am currently trying to use Logistic Regression on some vectors and I use the sklearn library.
Here is my code. I first the files that contain the data and the assign the values to arrays.
# load files
xvectors_train = kaldiio.load_scp('train/xvector.scp')
# create empty arrays where to store the data
x_train = np.empty(shape=(len(xvectors_train.keys()), len(xvectors_train[list(xvectors_train.keys())[0]])))
y_train = np.empty(len(xvectors_train.keys()), dtype=object)
# assign values to the empty arrays
for file_id in xvectors_train:
x_train[i] = xvectors_train[file_id]
label = file_id.split('_')
y_train[i] = label[0]
i+=1
# create a model and train it
model = LogisticRegression( max_iter = 200, solver = 'liblinear')
model.fit(x_train, y_train)
# predict
model.predict(x_train)
#score
score = model.score(x_train, y_train)
For some reason even if I use the x_train data for my predictions the score is about 0.32. Shouldn't it be 1.0, because the model already knows the answers for those? If I use my test data the score is still like 0.32.
Does anyone know what the problem could be?
There isn't any obvious problem, and the result looks normal: your test score is very similar to your training score.
Most models try to learn the rules/params that generalize to new data, but NOT memorizing your existing training data, which means "Shouldn't it be 1.0, because the model already knows the answers for those?" is not true...
If you are actually seeing that your test set score is significantly lower than your training score (e.g., 0.32 vs 1.0), then it means your model is badly overfitting and needs to be fixed.
I have been following this tutorial
https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/
to create a sentiment analysis in python. However, here's what I don't understand: It seems to me that the data they use is already labeled? So, how do I use the training I did on the labeled data to then apply to unlabeled data?
I wanna do sth like this:
Assuming I have 2 dataframes:
df1 is a small one with labeled data, df2 is a big one with unlabeled data. I just finished training with df1. How do I then go about predicting the values for df2?
I thought it would be as straight forward as text_classifier.predict(df2.iloc[:,1].values), but that doesn't work for me.
Also, forgive me if this question may seem stupid, but I don't have a lot of experience with machine learning and nltk ...
EDIT:
Here is the code I'm working on:
enc = preprocessing.LabelEncoder()
//chat_data = chat_data[:180]
//chat_labels = chat_labels[:180]
chat_labels = enc.fit_transform(chat_labels)
vectorizer = TfidfVectorizer (max_features=2500, min_df=1, max_df=1, stop_words=stopwords.words('english'))
features = vectorizer.fit_transform(chat_data).toarray()
print(chat_data)
X_train, X_test, y_train, y_test = train_test_split(features, chat_labels, test_size=0.2, random_state=0)
text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
chatData = pd.read_csv(r"C:\Users\jgott\OneDrive\Dokumente\Thesis\chat.csv")
unlabeled = chatData.iloc[:,1].values
unlabeled = vectorizer.fit_transform(unlabeled.astype('U'))
print(unlabeled)
//features = vectorizer.fit_transform(unlabeled).toarray()
predictions = text_classifier.predict(unlabeled)
Most of it is taken exactly from the tutorial, except for the line with astype in it, which I used to convert the unlabeled data, because I got a valueError that told me it can't convert from String to float if I don't do that first.
how do I use the training I did on the labeled data to then apply to
unlabeled data?
This is really the problem that supervised ML tries to solve: having known labeled data as inputs of the form (sample, label), a model tries to discover the generic patterns that exist in these data. These patterns hopefully will be useful to predict the labels of unseen unlabeled data.
For example in sentiment-analysis (sad, happy) problem, the pattern that may be discovered by a model after training process:
Existence of one or more of this words means sad:
("misery" , 'sad', 'displaced people', 'homeless'...)
Existence of one or more of this words means happy:
("win" , "delightful", "wedding", ...)
If new textual document is given we will search for these patterns inside this document and we will label it accordingly.
As side note: we usually do not use the whole labeled dataset in the training process, instead we take a small portion from the dataset(other than the training set) to validate our model, and verify that it discovered a really generic patterns, not ones tailored specifically for the training data.
I'm trying Bag of Words problem with a dataset which has two columns - summary and solution. I'm using KNN for it. The train dataset has 91 columns and the test dataset has 15 columns.
To generate the vectors, I'm using the following piece of code.
vectorizer = CountVectorizer()
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
print( vectorizer.fit_transform(dataset[0]).todense() )
print( vectorizer.vocabulary_ )
I trained it.
classifier = KNeighborsClassifier(n_neighbors=3)
classifier.fit(train_bow_set, dataset[1])
Now, I'm testing it.
y_pred = classifier.predict(test_bow_set)
Here, I'm getting below error when I test it:
sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.kd_tree.BinaryTree.query()
**ValueError: query data dimension must match training data dimension**
I guess you are fitting the vectorizer on the test data again instead of using transform function.
Make sure you are doing the following.
test_bow_set = vectorizer.transform(test_dataset)
You are fitting again the vectorizer:
train_bow_set = vectorizer.fit_transform(dataset[0]).todense()
You need to keep the vectorizer from the training (all the preprocessing elements actually) and only use transform. Fitting again will profoundly change the results.
train_bow_set = vectorizer.transform(dataset[0]).todense()
I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.
predicted_cluster = model.predict(test_documents)
Error message:
ValueError: could not convert string to float...
Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?
You need to transform the test_documents the same way in which train was transformed.
X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)
Make sure you only call transform on the test documents and use the same vectorizer object which was used for fit() or fit_transform() on train documents.