I'm trying to use a basic Naive-Bayes Classifier in Python using VSC. My attempts all yield 0.0 accuracy.
This is sample data: A CSV without header, of format
class,"['item1','item2','etc']"
The goal is to fit this data to a Multinomial NB model. This is my attempt at it:
df = pandas.read_csv('file.csv', delimiter=',',names=['class','words'],encoding='utf-8')
#x is independent var/feature
X = df.drop('class',axis=1)
#y is dependent var/label
Y = df['class']
#split data into train/test splits, use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.25,random_state = 42)
#create a sparse matrix of words; each word is assigned a number and frequency is counted (i.e. word "x" occurs n amount of times in class Z), rows are classes, columns are words
cv = CountVectorizer()
X_tr = cv.fit_transform(X_train.words)
X_te = cv.transform(X_test.words)
model = MultinomialNB()
model.fit(X_tr,y_train)
y_pred = model.predict(X_te)
print(metrics.accuracy_score(y_test, y_pred))
# accuracy = accuracy_score(y_test,y_pred)*100
# print(accuracy)
As I understand it the following occurs:
A dataframe, df, is created, and split into X and Y (words and classes)
The data's collectively split into training/testing groups
The count vectorizer, CV, assigns an index to each word and counts how many times a certain word occurs in a certain class (word occurences as numbers)
A Multinomial model is created and fit with the training data (x_train.words is used so as the "words" label is ignored)
the model is tested with testing data and an accuracy score is printed.
I've already tried:
Checking the shape of the x_test and x_train dataframe: they match like I think they should, with an equal amount of columns (words), and a 6:3 ratio of rows (classes, per the train test split)
Checking the variable types: the training and testing x's are all sparse matrices (<class 'scipy.sparse.csr.csr_matrix'>) and the testing/training y's are, per the parameters of model.fit, array-like shapes of n samples (pandas series).
The Issue is that the accuracy is 0.0, meaning something's wrong. Perhaps the greater issue is that I have no idea what.
The problem is that the length of your whole data frame is just 9. Just 9 rows. So your model doesn't learn anything. Also, I checked your dataset and I don't think you can make a sentence classifier from it as there are no sentences in your dataset.
Related
For some reasons, I have base dataframes of the following structure
print(df1.shape)
display(df1.head())
print(df2.shape)
display(df2.head())
Where the top dataframe is my features set and my bottom is the output set. To turn this into a problem that is amenable to data modeling I first do:
x_train, x_test, y_train, y_test = train_test_split(df1, df2, train_size = 0.8)
I then have a split for 80% training and 20% testing.
Since the output set (df2; y_test/y_train) is individual measurements with no inherent meaning on their own, I calculate pairwise distances between the labels to generate a single output value denoting the pairwise distances between observations using (the distances are computed after z-scoring; the z-scoring code isn't described here but it is done):
y_train = pdist(y_train, 'euclidean')
y_test = pdist(y_test, 'euclidean')
Similarly I then apply this strategy to the features set to generate pairwise distances between individual observations of each of the instances of each feature.
def feature_distances(input_vector):
modified_vector = np.array(input_vector).reshape(-1,1)
vector_distances = pdist(modified_vector, 'euclidean')
vector_distances = pd.Series(vector_distances)
return vector_distances
x_train = x_train.apply(feature_distances, axis = 0)
x_test = x_test.apply(feature_distances, axis = 0)
I then proceed to train & test all of my models.
For now I am trying linear regression , random forest, xgboost.
Is there any easy way to implement a cross validation scheme in my dataset?
Since my problem requires calculating pairwise distances between observations, I am struggling to identify an easy way to do cross validation schemes to optimize parameter tuning.
GridsearchCV doesn't quite work here since in each instance of the test/train split, distances have to be recomputed to avoid contamination of test with train.
Hope it's clear!
First, what I understood from the shape of your data frames that you have 42 samples and 1643 features in the input, and each output vector consists of 392 values.
Huge Input: In case, you are sure that your problem has 1643 features, you might need to use PCA to reduce the dimensionality instead of pairwise distance. You should collect more samples instead of 42 samples to avoid overfitting because it is not enough data to train and test your model.
Huge Output: you could use sampled_softmax_loss to speed up the training process as mentioned in TensorFlow documentation . You could also read this here. In case, you do not want to follow this approach, you can continue training with this output but it takes some time.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=n)
here X is independent feature, y is dependent feature means what you actually want to predict - it could be label or continuous value. We used train_test_split on train dataset and we are using (x_train, y_train) to train model and (x_test, y_test) to test model to ensure performance of model on unknown data(x_test, y_test). In your case you have given y as df2 which is wrong just figure out your target feature and give it as y and there is no need to split test data.
I have spent 30 hours on this single problem de-bugging and it makes absolutely no sense, hopefully one of you guys can show me a different perspective.
The problem is that I use my training dataframe in a random forest and get very good accuracy 98%-99% but when I try and load in a new sample to predict on. The model ALWAYS guesses the same class.
# Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)
# Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)
# Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)
# Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
# Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)
print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))
The way I am processing the data is the same, but when I predict on the X_test or X_train I get my normal 98% and when I predict on my new data it always guesses the same class.
# The json file is not in the correct format, this function normalizes it
normalized_json = json_normalizer(json_file, "", training=False)
# Turn the json into a list of dictionaries which contain the features
features_dict = create_dict(normalized_json, label=None)
# Convert the dictionaries into pandas dataframes
df = pd.DataFrame.from_records(features_dict)
print('Total amount of email samples: ', len(df))
print()
df = df.fillna(-1)
# One hot encodes string values
df = one_hot_encode(df, noOverride=True)
if 'label' in df.columns:
df = df.drop(['label'], axis='columns')
print(list(model.predict(df))[:100])
print(list(model.predict(X_train))[:100])
Above is my testing scenario, you can see in the last two lines I am predicting on X_train the data used to train the model and df the out of sample data that it always guesses class 0.
Some useful information:
The datasets are imbalanced; class 0 has about 150,000 samples while class 1 has about 600,000 samples
There are 141 features
changing the n_estimators and max_depth doesn't fix it
Any ideas would be helpful, also if you need more information let me know my brain is fried right now and that's all I could think of.
Fixed, The issue was the imbalance of datasets also I realized that changing the depth gave me different results.
For example, 10 trees with 3 depth -> seemed to work fine
10 trees with 6 depth -> back to guessing only the same class
I have 15 features with a binary response variable and I am interested in predicting probabilities than 0 or 1 class labels. When I trained and tested the RF model with 500 trees, CV, balanced class weight, and balanced samples in the data frame, I achieved a good amount of accuracy and also good Brier score. As you can see in the image, the predicted probabilities values of class 1 on test data are in between 0 to 1.
Here is the Histogram of predicted probabilities on test data:
with majority values at 0 - 0.2 and 0.9 to 1, which is much accurate.
But when I try to predict the probability values for unseen data or let's say all data points for which value of 0 or 1 is unknown, the predicted probabilities values are between 0 to 0.5 only for class 1. Why is that so? Aren't the values should be from 0.5 to 1?
Here is the histogram of predicted probabilities on unseen data:
I am using sklearn RandomforestClassifier in python. The code is below:
#Read the CSV
df=pd.read_csv('path/df_all.csv')
#Change the type of the variable as needed
df=df.astype({'probabilities': 'int32', 'CPZ_CI_new.tif' : 'category'})
#Response variable is between 0 and 1 having actual probabilities values
y = df['probabilities']
# Separate majority and minority classes
df_majority = df[y == 0]
df_minority = df[y == 1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=100387, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df1 = pd.concat([df_majority, df_minority_upsampled])
y = df1['probabilities']
X = df1.iloc[:,1:138]
#Change interfere values to category
y_01=y.astype('category')
#Split training and testing
X_train, X_valid, y_train, y_valid = train_test_split(X, y_01, test_size = 0.30, random_state = 42,stratify=y)
#Model
model=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#I had 137 variable, to select the optimum one, I used RFECV
rfecv = RFECV(model, step=1, min_features_to_select=1, cv=10, scoring='neg_brier_score')
rfecv.fit(X_train, y_train)
#Retrained the model with only 15 variables selected
rf=RandomForestClassifier(n_estimators = 500,
max_features= 'sqrt',
n_jobs = -1,
oob_score = True,
bootstrap = True,
random_state=0,class_weight='balanced',)
#X1_train is same dataframe with but with only 15 varible
rf.fit(X1_train,y_train)
#Printed ROC metric
print('roc_auc_score_testing:', metrics.roc_auc_score(y_valid,rf.predict(X1_valid)))
#Predicted probabilties on test data
predv=rf.predict_proba(X1_valid)
predv = predv[:, 1]
print('brier_score_training:', metrics.brier_score_loss(y_train, predt))
print('brier_score_testing:', metrics.brier_score_loss(y_valid, predv))
#Output is,
roc_auc_score_testing: 0.9832652130944419
brier_score_training: 0.002380976369884945
brier_score_testing: 0.01669848089917487
#Later, I have images of that 15 variables, I created a data frame out(sample_img) of it and use the same function to predict probabilities.
IMG_pred=rf.predict_proba(sample_img)
IMG_pred=IMG_pred[:,1]
The results shown for your test data are not valid; you perform a mistaken procedure that has two serious consequences, which invalidate them.
The mistake here is that you perform the minority class upsampling before splitting to train & test sets, which should not be the case; you should first split into training and test sets, and then perform the upsampling only to the training data and not to the test ones.
The first reason why such a procedure is invalid is that, this way, some of the duplicates due to upsampling will end up both to the training and the test splits; the result being that the algorithm is tested with some samples that have already been seen during training, which invalidates the very fundamental requirement of a test set. For more details, see own answer in Process for oversampling data for imbalanced binary classification; quoting from there:
I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...
The second reason is that this procedure shows biased performance measures in a test set that is no longer representative of reality: remember, we want our test set to be representative of the real unseen data, which of course will be imbalanced; artificially balancing our test set and claiming that it has X% accuracy when a great part of this accuracy will be due to the artificially upsampled minority class makes no sense, and gives misleading impressions. For details, see own answer in Balance classes in cross validation (the rationale is identical for the case of train-test split, as here).
The second reason is why your procedure would still be wrong even if you had not performed the first mistake, and you had proceeded to upsample the training and test sets separately after splitting.
I short, you should remedy the procedure, so that you first split into training & test sets, and then upsample your training set only.
I converted two columns of a pandas dataframe into numpy arrays to use as the features and labels for a machine learning problem.
Code:
train_index, test_index = next(iter(ShuffleSplit(len(labels), train_size=0.2, test_size=0.80, random_state=42)))
features_train, features_test, = X[train_index], X[test_index]
labels_train, labels_test = labels[train_index], labels[test_index]
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features)
print pred
Features is currently an array of frequency counts (I used a CountVectorizer earlier to fit and transform my original pandas dataframe column). I have the full list of labels stored as pred, but I would like the corresponding feature to each label, so that I may return the list of labels to my pandas dataframe.
Ordering of predictions is the same as passed data (and as #Ulf pointed out - you are incorrectly using term "feature" here, feature is a column of your matrix, particular object that you are counting using countvectorizer; rows are observations, samples, data-points - and this is what you currently call features). Thus in order to see sample-label pairs you can simply zip them together:
pred = clf.predict(features)
for sample, label in zip(features, pred):
print sample, label
If you actually want to recover what each column means, your CountVectorizer is your guy. Somewhere in your code you created it
vectorizer = CountVectorizer( ... )
and later used it
... = vectorizer.fit_transform( ... )
now you can use it to transform your samples back through
pred = clf.predict(features)
for sample, label in zip(features, pred):
print vectorizer.inverse_transform(np.array([sample])), label
I am trying to make a sentiment analyser using the scikit-learn LinearSVC classifier. The problem is that the classifier is classifying every sentence as a positive. Another question is - why is the function predict() returning me a list of the classified label for every text? I thought that it should return only one text/number which is the classified label. Here is a sample cut from the code.
vectorizer = TfidfVectorizer(input='content', decode_error='ignore')
vect_train_x = vectorizer.fit_transform(training_data) # this is actually a list of sentences
scaler = StandardScaler(with_mean=False) # I don't know why it should be False
X_train = scaler.fit_transform(vect_train_x) # compute mean, std and transform training data as well
vect_test_x = vectorizer.transform(test) # the sentence that needs to be classified
X_test = scaler.transform(vect_test_x)
clf = LinearSVC()
clf.fit(X_train, labels)
print vect_test_x
print clf.predict(X_test) # returning me a list of Positive => ['Positive' 'Positive' 'Positive' 'Positive' 'Positive' 'Positive']
I would be very grateful if you explain me what exactly I am not understanding. I tried to read the documentation but without any examples I could not understand it. My training data consists of 100 000 positive and 100 000 negative sentences.
Came across this, I had the same issue, what solved my problem was to convert X_test to a list first, then to np.array which then needs to be pass to the 'predict' function
new_array = []
new_array.append(Input) #Input is string if reading from a file or from input()
X_test = np.array(new_array)
print clf.predict(X_test)