How can I get the corresponding features back? - python

I converted two columns of a pandas dataframe into numpy arrays to use as the features and labels for a machine learning problem.
Code:
train_index, test_index = next(iter(ShuffleSplit(len(labels), train_size=0.2, test_size=0.80, random_state=42)))
features_train, features_test, = X[train_index], X[test_index]
labels_train, labels_test = labels[train_index], labels[test_index]
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features)
print pred
Features is currently an array of frequency counts (I used a CountVectorizer earlier to fit and transform my original pandas dataframe column). I have the full list of labels stored as pred, but I would like the corresponding feature to each label, so that I may return the list of labels to my pandas dataframe.

Ordering of predictions is the same as passed data (and as #Ulf pointed out - you are incorrectly using term "feature" here, feature is a column of your matrix, particular object that you are counting using countvectorizer; rows are observations, samples, data-points - and this is what you currently call features). Thus in order to see sample-label pairs you can simply zip them together:
pred = clf.predict(features)
for sample, label in zip(features, pred):
print sample, label
If you actually want to recover what each column means, your CountVectorizer is your guy. Somewhere in your code you created it
vectorizer = CountVectorizer( ... )
and later used it
... = vectorizer.fit_transform( ... )
now you can use it to transform your samples back through
pred = clf.predict(features)
for sample, label in zip(features, pred):
print vectorizer.inverse_transform(np.array([sample])), label

Related

KNN Classifier Python

I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe

Accuracy of 0.0 using Scikit Naive-Bayes model

I'm trying to use a basic Naive-Bayes Classifier in Python using VSC. My attempts all yield 0.0 accuracy.
This is sample data: A CSV without header, of format
class,"['item1','item2','etc']"
The goal is to fit this data to a Multinomial NB model. This is my attempt at it:
df = pandas.read_csv('file.csv', delimiter=',',names=['class','words'],encoding='utf-8')
#x is independent var/feature
X = df.drop('class',axis=1)
#y is dependent var/label
Y = df['class']
#split data into train/test splits, use 25% of data for testing
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.25,random_state = 42)
#create a sparse matrix of words; each word is assigned a number and frequency is counted (i.e. word "x" occurs n amount of times in class Z), rows are classes, columns are words
cv = CountVectorizer()
X_tr = cv.fit_transform(X_train.words)
X_te = cv.transform(X_test.words)
model = MultinomialNB()
model.fit(X_tr,y_train)
y_pred = model.predict(X_te)
print(metrics.accuracy_score(y_test, y_pred))
# accuracy = accuracy_score(y_test,y_pred)*100
# print(accuracy)
As I understand it the following occurs:
A dataframe, df, is created, and split into X and Y (words and classes)
The data's collectively split into training/testing groups
The count vectorizer, CV, assigns an index to each word and counts how many times a certain word occurs in a certain class (word occurences as numbers)
A Multinomial model is created and fit with the training data (x_train.words is used so as the "words" label is ignored)
the model is tested with testing data and an accuracy score is printed.
I've already tried:
Checking the shape of the x_test and x_train dataframe: they match like I think they should, with an equal amount of columns (words), and a 6:3 ratio of rows (classes, per the train test split)
Checking the variable types: the training and testing x's are all sparse matrices (<class 'scipy.sparse.csr.csr_matrix'>) and the testing/training y's are, per the parameters of model.fit, array-like shapes of n samples (pandas series).
The Issue is that the accuracy is 0.0, meaning something's wrong. Perhaps the greater issue is that I have no idea what.
The problem is that the length of your whole data frame is just 9. Just 9 rows. So your model doesn't learn anything. Also, I checked your dataset and I don't think you can make a sentence classifier from it as there are no sentences in your dataset.

Given feature/column names do not match the ones for the data given during fit. Error

I wrote following code and it gives me this error :
"Given feature/column names do not match the ones for the data given
during fit."
Train and predict data has the same features.
df_train = data_preprocessing(df_train)
#Split X and Y
X_train = df_train.drop(target_columns,axis=1)
y_train = df_train[target_columns]
#Create a boolean mask for categorical columns
categorical_columns = X_train.columns[X_train.dtypes == 'O'].tolist()
# Create a boolean mask for numerical columns
numerical_columns = X_train.columns[X_train.dtypes != 'O'].tolist()
# Scaling & Encoding objects
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
col_transformers = ColumnTransformer(
# name, transformer itself, columns to apply
transformers=[("scaler_onestep", numeric_transformer, numerical_columns),
("ohe_onestep", categorical_transformer, categorical_columns)])
#Manual PROCESSING
model = MultiOutputClassifier(
xgb.XGBClassifier(objective="binary:logistic",
colsample_bytree = 0.5
))
#Define a pipeline
pipeline = Pipeline([("preprocessing", col_transformers), ("XGB", model)])
pipeline.fit(X_train, y_train)
#Data Preprocessing
predicted = data_preprocessing(predicted)
X_predicted = predicted.drop(target_columns,axis=1)
predictions=pipeline.predict(X_predicted)
I got error on prediction process. How can i fix this problem? I couldn't find any solution.
Try reordering the columns in X_predicted so that they exactly match X_train.
I am guessing the features names in the training dataset are not identical with the predicted dataset.
For example if you have 19 features in the training dataset, it must be the same 19 samples in the predicted dataset. The model cannot test on features, it has not seen before.
Adding to the above answers - if you're using a ColumnTransformer like OP and are unsure about what the column names were at the time the model was fit, you can use pipeline.named_steps['preprocessing']._feature_names_in to figure it out.

How do you take a text file and change split it into data usable for a machine learning classifier?

for this practice exercise I'm only supposed to use numpy so I can't just use scikit learn.
I've loaded the data set and managed to split it into positive and negative arrays. However I'm not sure what to do now or even what I am doing right to process the data for the classifier.
datasettrain = np.loadtxt("Adaboost-trainer.txt")
negtrain, postrain = np.delete(datasettrain[datasettrain[:,2] < 0],2,1), np.delete(datasettrain[datasettrain[:,2] > 0],2,1)
clf = Adaboost(n_clf=5)
clf.fit(postrain, negtrain)
I know I'm supposed to be inputting features and labels but surely the data has to be in a different format for that? as opposed to just a plain text file? at least I always received data that was just labeled with features and labels and I could input it just by splitting that data. Any thoughts on how someone might process just a regular text file into features and labels?
edit
1.116574 0.157686 +1
-0.359096 0.653998 -1
1.845620 0.873235 +1
-0.271484 -0.960392 -1
0.304631 2.797998 +1
Ah, if I'm interpreting your sample data correctly, the first two columns are your feature columns and the last column is your target values. If this is correct, then to get training and test sets, you would need to do something like the following:
import numpy as np
data = np.loadtxt("Adaboost-trainer.txt")
# Determine your training/test split. I opted for 80/20
test_size = 0.2
split_index = int(data.shape[0] * test_size)
# Get the full train and test splits
indices = np.random.permutation(data.shape[0])
test_idx = indices[split_index:]
train_idx = indices[:split_index]
test = data[test_idx,:]
train = data[train_idx,:]
# Split the X and y for use in models
y_train = train[:,-1]
X_train = np.delete(train, 2, axis=1)
y_test = test[:,-1]
X_test = np.delete(test, 2, axis=1)
From there, you have an 80/20 train/test split of your data for use with a model.

How to make machine learning predictions for empty rows?

I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new

Categories