I am very new to Machine Learning and I would like to get a percentage returned for an individual array that I pass in the prediction model I have created.
I'm not sure how to go about getting the match percentage. I thought it was metrics.accuracy_score(Ytest, y_pred) but when I try that it gives me the following error:
**ValueError: Found input variables with inconsistent numbers of samples: [4, 1]**
I have no idea if this is the correct way to go about this.
import numpy as np #linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #For Visualisation
import seaborn as sns #For better Visualisation
from bs4 import BeautifulSoup #For Text Parsing
import mysql.connector
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
import docx2txt
import re
import csv
from sklearn import metrics
class Machine:
TrainData = ''
def __init__(self):
self.TrainData = self.GetTrain()
Data = self.ProcessData()
x = Data[0]
y = Data[1]
x, x_test, y, y_test = train_test_split(x,y, stratify = y, test_size = 0.25, random_state = 42)
self.Predict(x,y, '',x_test , y_test )
def Predict(self,X,Y,Data, Xtext, Ytest):
model = GaussianNB()
model.fit(Xtext, Ytest)
y_pred = model.predict([[1.0, 2.00613, 2, 5]])
print("Accuracy:",metrics.accuracy_score(Ytest, y_pred))
def ProcessData(self):
X = []
Y = []
i = 0
for I in self.TrainData:
Y.append(I[4])
X.append(I)
i = i + 1
i = 0
for j in X:
X[i][0] = float(X[i][0])
X[i][1] = float(X[i][1])
X[i][2] = int(X[i][2])
X[i][3] = int(X[i][3])
del X[i][4]
i = i + 1
return X,Y
def GetTrain(self):
file = open('docs/training/TI_Training.csv')
csvreader = csv.reader(file)
header = []
header = next(csvreader)
rows = []
for row in csvreader:
rows.append(row)
file.close()
return rows
Machine()
The error is pretty clear: YTest has 4 samples, and y_pred only has one. You need an equal number of samples in each to get any metrics. I suspect you instead want to do
y_pred = model.predict(Xtext)
in your Predict function.
Related
I have a pretty basic question. My X data is df['input'], Y data is df['label']. This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer
Xfeatures = df['input']
y = df['label']
tfidf_vec = TfidfVectorizer(max_features= MF,
max_df = MAXDF)
X = tfidf_vec.fit_transform(Xfeatures)
featurenames = tfidf_vec.get_feature_names()
X.todense()
df_vec = pd.DataFrame(X.todense(),columns=tfidf_vec.get_feature_names())
df_vec.T
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X, y,test_size = 0.33,random_state = 28)
This is the model that I run for text classification:
from sklearn.svm import SVC
lr_model = SVC()
lr_model.fit(x_train,y_train)
y_pred = lr_model.predict(x_test)
# Acccuracy
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
I would like to identify the those that are misclassified (i.e. df['input']). I can write down predicted and actual the categories into csv, but not the text that is misclassified (or training data in general):
import csv
rows = zip(y_test, y_pred)
with open(r"C:\Users\erdem\Desktop\data.csv", "w", newline="") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
Try to go with
X[y_test != y_pred]
y_test != y_pred will be a binary array with True on misclassified data (and False on correct predictions): you can use that as indices for your X (or Xfeatures).
I'm currently trying the following concept:
I applied np.log1p() to the independent variables and dependent variable (price)
Assuming X = independent variables and Y = dependent variable, I train_test_split X & Y
Then I trained the LinearRegression(), Ridge(), Lasso(), and ElasticNet() models
Given that the labels I used to train the model were also log1p(Y), I'm assuming the model predictions are also log values?
If the predictions are log values, how come np.expm1 doesn't return a value that is on a similar scale?
Linear Regression Code for reference
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from scipy.stats import skew
from scipy import stats
from scipy.stats import norm
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, ShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
df_num = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df_cat = pd.DataFrame(np.random.randint(0,2,size=(10000, 2)), columns=['cat1', 'cat2'])
price = pd.DataFrame(np.random.randint(0,100,size=(10000, 1)), columns=['price'])
y = price
skewness = df_num.apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
df_num[skewed_features] = np.log1p(df_num[skewed_features])
y = np.log1p(y)
train = pd.concat([df_num, df_cat], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size = 0.3, random_state = 0)
lr_clf = LinearRegression()
lr_clf.fit(X_train, y_train)
def predict_price(A, B, C, D, cat1):
cat1_index = np.where(train.columns == cat1)[0][0]
x = np.zeros(len(train.columns))
x[0] = np.log1p(A)
x[1] = np.log1p(B)
x[2] = np.log1p(C)
x[3] = np.log1p(D)
if cat1_index >= 0:
x[cat1_index] = 1
return np.expm1(lr_clf.predict([x])[0])
predict_price(20, 30, 15, 55, 'cat2')
EDIT1: I tried to recreate an example from scratch, but I can't seem to replicate the issue I'm running into. The issue I run into in my real data is that:
predictions work totally fine if I DON'T log-normalize inputs when training and DON'T log normalize inputs when predicting.
HOWEVER when I do log-normalize when training and log normalize inputs and np.expm1 the prediction, the value is totally off.
Please let me know if there is anything I can explain more clearly.
I have been writing this code and i have gotten to the point where it runs but does not converge unfortunately. Could someone please have a look because I have checked many things and not too sure why it isn't converging. The data set is from here: https://github.com/nshomron/covidpred/blob/master/data/corona_tested_individuals_ver_006.english.csv.zip
The code I have hoped to split it up to make it a bit clear:
#---------- IMPORTS ----------
import numpy as np
import matplotlib as plt
from numpy.core.defchararray import index
import pandas as pd
from pandas.core.tools.datetimes import Scalar
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import svm
#---------- PREPROCESSING ----------
#---------- Import data ----------
data = pd.read_csv(r'C:\Users\Saaqib\Documents\Python\PythonProjects\Covidproject\corona_tested_individuals.csv', )
X = data.loc[:, data.columns != 'corona_result']
X = X.loc[:, X.columns != 'test_date']
y = data.iloc[:,6]
#---------- Encode data ----------
Le_X = LabelEncoder()
X['age_60_and_above'] = Le_X.fit_transform(X['age_60_and_above'])
X['gender'] = Le_X.fit_transform(X['gender'])
X['test_indication'] = Le_X.fit_transform(X['test_indication'])
# print('data=',X)
y = Le_X.fit_transform(y)
y = np.array(y)
Hot_enc_X = OneHotEncoder()
enc_X = pd.DataFrame(Hot_enc_X.fit_transform(X[['gender','test_indication']]).toarray())
X = X.join(enc_X)
X = X.drop(columns=['gender','test_indication'])
X = X.replace("None", float('nan'))
X["cough"] = X["cough"].fillna(0)
X["fever"] = X["fever"].fillna(0)
X["sore_throat"] = X["sore_throat"].fillna(0)
X["shortness_of_breath"] = X["shortness_of_breath"].fillna(0)
X["head_ache"] = X["head_ache"].fillna(0)
X["age_60_and_above"] = X["age_60_and_above"].fillna(0)
X['cough'] = X['cough'].astype(float)
X['fever'] = X['fever'].astype(float)
X['sore_throat'] = X['sore_throat'].astype(float)
X['shortness_of_breath'] = X['shortness_of_breath'].astype(float)
X['head_ache'] = X['head_ache'].astype(float)
X['age_60_and_above'] = X['age_60_and_above'].astype(float)
#---------- Split data set ----------
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
#---------- Train Model ----------
covid_model = svm.SVC(kernel='linear')
covid_model.fit(X_train, y_train)
predictions = covid_model.predict(X_test)
acc = accuracy_score(y_test,predictions)
print("pred:", predictions)
print("acc:", acc)
I have typed in the following lines of code:
# import relevant statistical packages
import numpy as np
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import sklearn.linear_model as skl
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
# import data
url = "/<...>/Smarket.csv" # relative url within my computer
Smarket = pd.read_csv(url, index_col = 'SlNo')
X3 = Smarket[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
Y3 = Smarket['Direction']
X_train, X_test, y_train, y_test = train_test_split(X3, Y3, test_size=0.2016)
data_1 = pd.concat([pd.DataFrame(y_train), X_train], axis = 1)
model_1 = sm.formula.glm(formula = 'y_train~X_train', data = data_1, family= sm.families.Binomial()).fit()
X_new = model_1.predict(X_test)
Now it is in the last code where I recieve the following error:
PatsyError: Number of rows mismatch between data argument and X_train (252 versus 998)
y_train~X_train
^^^^^^^
I am just unable to understand why I am getting this error. I get it might be because of mismatch in the number of data between X_test and X_train. How do I need to change my code to get the predicted values?
i build a regression model to predict energy ( 1 columns ) from 5 variables ( 5 columns ) ... i used my exprimental data to train and fit the model and it works with good score ...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('new.csv')
X = data.drop(['E'],1)
y = data['E']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5 ,
random_state=2)
from sklearn import ensemble
clf1 = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth =5,
min_samples_split = 2, loss='ls',
learning_rate = 0.1)
clf1.fit(X_train, y_train)
clf1.score(X_test, y_test)
but now i want to add a new csv file contain new data for mentioned 5 variables to OrderedDict and use the model to predict energy ...
with code bellow i manually insert row by row and it predict energy correctly
from collections import OrderedDict
new_data = OrderedDict([('H',48.52512), ('A',169.8379), ('P',55.52512),
('R',3.058758), ('Q',2038.055)])
new_data = pd.Series(new_data)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but i cant do this with huge datasets and need to import csv file ... i do the bellow but cant figure it out ....
data_2 = pd.read_csv('new2.csv')
X_new = OrderedDict(data_2)
new_data = pd.Series(X_new)
data = new_data.values.reshape(1, -1)
clf1.predict(data)
but gives me : ValueError: setting an array element with a sequence.
can anyone help me ??