How to output Prediction Values into an Excel File? - python

new to scikit-learn and I want to take the prediction values and convert it back to text and output it into an excel file.
The way the project is setup is it takes a row of strings and predicts whether or not the column is a certain category (there is approximately 5 categories).
Description
Actual Answer
Prediction
Some string that is random in length per row
Car
Truck
I want to have the excel file output something like you see above. I do not want to output the numerical prediction results. I want to output the actual text itselfs.
Can anyone help me on how to do this?
This is my code so far:
X = df['without_Tags']
Y = df['Tower']
tokens = Tokenizer()
VectorX = tokens.texts_to_sequences(df['without_Tags'].values)
VectorX = pad_sequences(VectorX, maxlen=200)
VectorY = pd.get_dummies(df['Tower'])
X_train, X_test, y_train, y_test = train_test_split(VectorX, VectorY, test_size=0.20, random_state=0)
# Model Creation
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

You can do this:
import pandas as pd
CSV = pd.DataFrame({
"Prediction": y_pred
})
CSV.to_csv("prediction.csv", index=False)
The file will be named "prediction.csv" and will be saved in your source code file directory.
Update:
import pandas as pd
csv = pd.DataFrame(y_pred, columns=["1st", "2nd", "3rd", "4th", "5th"])
csv.to_csv("pred.csv", index=False)

Related

KNN Classifier Python

I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe

How can I do classification with a float data type after normalization?

I am working with a data set labeled Adult and I am trying to run a KNN on a few of the columns I have made into a new data Frame and normalized a couple of the columns. I am getting a ValueError: Unknown label type: 'continuous' error when trying to run
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
After researching the error on line it seems that I need to use a label encoder on my data after I have normalized it, because it is now a float rather than an int but I am having trouble with using the label encoder. The code I am using is:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[Adult.loc[:, "race"] == "Amer-Indian-Eskimo", "race"] = "Other" #consolidating catagorical data in the race column
Adult.loc[:,"race"].value_counts().plot(kind='bar') #plotting the consolidated catagorical data in the race column
plt.title('race after consolidation')
plt.show()
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int) #One hot encoding the catagorical/creating new categorical data in the race column
Adult.loc[:, "Black"] = (Adult.loc[:, "race"] == "Black").astype(int)
Adult.loc[:, "Asian-Pac-Islander"] = (Adult.loc[:, "race"] == "Asian-Pac-Islander").astype(int)
Adult.loc[:, "Other"] = (Adult.loc[:, "race"] == "Other").astype(int)
Adult.loc[:,"Other"] #Verifying One-hot encoding for Other column
Adult = Adult.drop("race", axis=1) #removing the obsolete column "race"
Minage = min(Adult.loc[:,"age"]) #MinMax normalizing the age column
Maxage = max(Adult.loc[:,"age"])
MinMaxage = (Adult.loc[:,"age"] - Minage)/(Maxage - Minage)
Minhours = min(Adult.loc[:,"hoursperweek"]) #MinMax ormalizing the hoursperweek column
Maxhours = max(Adult.loc[:,"hoursperweek"])
MinMaxhours = (Adult.loc[:,"hoursperweek"] - Minhours)/(Maxhours - Minhours)
df2 = pd.DataFrame() #creating a dataframe to plot the normilized data
df2.loc[:,0] = Adult.loc[:, "White"] #filling the data frame
df2.loc[:,1] = MinMaxage
df2.loc[:,2] = MinMaxhours
df2.columns = ["White","MinMaxage","MinMaxhours"]
X = np.array(df2.drop(['MinMaxhours'], 1))
y = np.array(df2['MinMaxhours'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(accuracy)
clf.predict(X_test)
y_test
Could someone help me out with how to label encode the data so I can perform Knn on the data? I have looked it up on the sklearn site and different examples, but am still having trouble using it on my dataset. I receive the error message when trying to fit the data running clf.fit(X_train, y_train)
It looks like you have a regression problem instead of a classification problem. You are trying to predict the MinMaxHours variable, which is a real number. If you are trying to predict real number you should use the regression version of the Neirest neighbor algorithm. The following code should work in order to get a prediction.
from sklearn.neighbors import KNeighborsRegressor
clf = KNeighborsRegressor()
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)

Do predictions using training data

I have 2 csv files. one is a training dataset and the other is test dataset. Training dataset contains 36 columns. One column of that is the outcome which have A-F as values. The test dataset has 35 columns which does not have the outcome. I want to add an outcome column to the test dataset as well. I searched for several tutorials but did not find the method that I should follow. Can any one tell about the process that I should follow?
You haven't supplied any sample data and the technique you want to use, my below code will have you to understand how can you make prediction in general:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
Assuming you have read 2 csv files, train & test
X_train = train.loc[:, train.columns != 'Outcome'] # <---- Here you exclude the Outcome from train
y_train = train['Outcome'] # <---- This is your Outcome
le = LabelEncoder()
y_train = le.fit_transform(y_train) # <---- Here you convert your A-F values to numeric(0-5)
I am assuming rest of the x variables are numeric.
rf = RandomForestClassifier() # <---- Here you call the classifier
rf.fit(X_train, y_train) # <---- Fit the classifier on train data
rf.score(X_train, y_train) # <---- Check model accuracy
y_pred = pd.DataFrame(rf.predict(test), columns = ['Outcome']) # <---- Make predictions on test data
test_pred = pd.concat([test, y_pred['Outcome']], axis = 1) # <---- Here you add predictions column to your test dataset
test_pred.to_excel(r'path\Test.xlsx')
That depends on how you will find/calculate the outcome you need to add.
One way would be to load the test dataset as a Pandas data frame. Calculate the outcome and add the values to a list which you the add to your Pandas dataframe:
import pandas as pd
data = pd.DataFrame(columns=['Names', 'Age', 'Outcome'])
names = ['John', 'Nicole', 'Evan']
age = [53, 23, 27]
data['Names'] = names
data['Age'] = age
outcome = [6545, 5252, 85665]
data['Outcome'] = outcome

ValueError: could not convert string to float: 'Pregnant'

I am solving a decision tree classification problem. code is below
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
And Preview of dataset:
dataset
I am getting an error
ValueError: could not convert string to float: 'Pregnant'
Please help me solve this error.
Change this line to read the data with headers from csv file:
From:
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
to
pima = pd.read_csv("diabetes.csv") # This will import the data file with the header names from the csv, which you can change later if required.
Or manually remove the top row using this code:
pima = pima.iloc[1:]
The first non header line of your dataset contains what looks to be a duplicate header line. Thus the first value of X is "Pregnant" and not a float as you require.
You could either filter out non float values or fix your dataset.

Split Random Data Training And Data Testing in Pandas CSV

I'm newbie using Python
First of all, I would like to split my data training and data testing where
Data Training = 6 and Data Testing = 2
I confused using csv file to random data training and data testing
I've been trying to split data training and data testing but sequence same with csv file
Here we go my data training and data testing:
def ambilData():
df = pd.read_csv("datalatihnodummy.csv", sep=';')
dropdata = df.drop(['data', 'Klasifikasi'], axis =1)
datalatih = dropdata.iloc[:6]
datauji = dropdata.iloc[6:]
return datalatih, datauji
and here is a output of training :
and here is a output of testing:
I would like to test Hepatitis B only or Hepatitis A only.
Anyone know how to random my dataset? thx u^
here is my data: https://drive.google.com/open?id=1tD3h0aS-AB4qrMg2vw0fHcMx6F3jzJCx
try this:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)
it will randomly split your data into training and testing data.
the x_data is the independent variable - so here you want to drop the 'Klasifikasi'
the y_data is the dependent variable in yours df it's the 'Klasifikasi'
hope it will help

Categories