After running a neural network in sklearn, i am receiving inconsistant results, even after implementing the seed function. each time i run the code, i receive different values for MSE and R squared for each tested seed value. These values can range greatly with R squared being anything between -0.1 to 0.6. Im wondering if its a data issue as i only have 22 columns and 241 rows. Ive also tried setting
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001,random_state=0)
as well as changing the value of the random_state.
below is my code. Many thanks
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import numpy as np
data=pd.read_csv(r'''D:\PhD\1styear\machinelearning\NNforF2050\DATAnnF2050.csv''')
print(data.shape)
print(data.dtypes)
x=data.drop('EnergyConsumpManuf',axis=1)
y=data['EnergyConsumpManuf']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
from sklearn.metrics import accuracy_score
from math import sqrt
for i in range(15):
print('np.random.seed(%d)'%(i))
np.random.seed(i)
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001)
mlp.fit(x_train,y_train)
predictions=mlp.predict(x_test)
print('MSE train: ',metrics.mean_squared_error(y_test,predictions))
RMS=sqrt(metrics.mean_squared_error(y_test,predictions))
print('RMS',RMS)
RTWO=sklearn.metrics.r2_score(y_test,predictions)
print('RTWO',RTWO)
print('MAE',metrics.mean_absolute_error(y_test,predictions))
You need to set random_state parameter of train_test_split function as well. Without fixed random state, data is split randomly each time, that is why results change each time you run the code.
Related
I have a dataset of shape(700000,20) and I want to apply KNN to it.
However on testing it takes really huge time,can someone expert please help to let me know how can I reduce the KNN predicting time.
Is there something like GPU-KNN or something.Please help to let me know.
Below is the code I am using.
import os
os.chdir(os.path.dirname(os.path.realpath(__file__)))
import tensorflow as tf
import pandas as pd
import numpy as np
from joblib import load, dump
import numpy as np
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from dtaidistance import dtw
window_length = 20
n = 5
X_train = load('X_train.pth').reshape(-1,20)
y_train = load('y_train.pth').reshape(-1)
X_test = load('X_test.pth').reshape(-1,20)
y_test = load('y_test.pth').reshape(-1)
#custom metric
def DTW(a, b):
return dtw.distance(a, b)
clf = KNeighborsClassifier(metric=DTW)
clf.fit(X_train, y_train)
#evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
I can suggest reducing the number of features which i think its 20 features from your dataset shape, Which mean you have 20 dimensions.
You can reduce the number of features by using PCA (Principal Component Analysis) like the following:
from sklearn.decomposition import PCA
train_data_pca = PCA(n_components=10)
reduced_train_data = train_data_pca.fit_transform(train_data)
this code will reduce deminisions for example to 10 instead of 20
Second issue in your code, that I see that you are not using th K neighboors value in the classifier, It should be as the following:
clf = KNeighborsClassifier(n_neighbors=n, metric=DTW)
The metric dtw is taking too much time while simple knn is working fine.
I'm trying to calculate the accuracy for a twitter sentiment analysis project. However, I get this error, and I was wondering if anyone could help me calculate the accuracy? Thanks
Error: ValueError: Classification metrics can't handle a mix of continuous and multiclass targets
My code:
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("updated_tweet_info.csv")
data = df.fillna(' ')
train,test = train_test_split(data, test_size = 0.2, random_state = 42)
train_clean_tweet=[]
for tweet in train['tweet']:
train_clean_tweet.append(tweet)
test_clean_tweet=[]
for tweet in test['tweet']:
test_clean_tweet.append(tweet)
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(train_clean_tweet)
test_features=v.transform(test_clean_tweet)
lr = RandomForestRegressor(n_estimators=200)
fit1 = lr.fit(train_features, train['clean_polarity'])
pred = fit1.predict(test_features)
accuracy = accuracy_score(pred, test['clean_polarity'])`
You are trying to use the accuracy_score method, but accuracy is a classification metric.
In your case, try using a regression metric method like: mean_squared_error() and then applying np.sqrt(). This will return you the Root Mean Squared Error. The lower the number, the better. You can also look here for more details.
Try this:
import numpy as np
rmse = np.sqrt(mean_squared_error(test['clean_polarity'], pred))
This guy also had the same problem
I have fitted a model from which I'd like to know the scores (r-squared).
The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?
My code:
import pandas as pd
import numpy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()
y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)
To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.
Formula
Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer
SciKit Learn R-squared is very different from square of Pearson's Correlation R
Method 1: function
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Method 2: sklearn library
from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
It looks like you're using scikit-learn. If so, you can use the r2_score metric.
Does someone knows what is the problem? It seems like the column names are not recognized as header. Below is my code with in bold corresponding errors. I want to write a function that trains a logistic regression by splitting dataset into train and test sets (70% of data training and 30% of data for tests). Thank you in advance.
Import
import numpy as np
import pandas as pd
import csv
from sklearn.linear_model import Logistic Regression
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import data
from sklearn.model_selection import train_test_split
Load dataset, seprate data in columns and give columnnames:
colnames=["Watermark", "Micro-print", "Ultraviolet fields", "Magnetic fields", "Diameter","Target"]
Dataset=pd.read_csv("/Users/David/Documents/Python Assignment2/data-banknote.csv", sep=',', names=colnames)
Dataset.index=np.arange(1,len(Dataset)+1)
Define TrainData and TestData
TrainData= Dataset["Watermark"],Dataset["Micro-print"],Dataset["Ultraviolet fields"],Dataset["Magnetic fields"],Dataset["Diameter"]
TestData= Dataset["Target"]
Show head of the dataset
TrainData.head()
TestData.head()
Error is given that TrainData has no header?
Splitting the dataset
TrainData_train,TrainData_test,TestData_train,TestData_test = train_test_split(TrainData,TestData,test_size=0.3,random_state=0)
ValueError: Found input variables with inconsistent numbers of samples: [4, 25001]?
I am using gplearn library (genetic programming) for generating new rules from the given dataset. Currently I have 11 rows of data with 24 columns(features) that I give as input to the SymbolicRegressor method to get new rules. However, I am only getting only one rule. Generally with crossover shouldn't I get 11 new rules if I give 11 lines of data as input. If I doing it wrong what is the right way of doing it ?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesRegressor
from gplearn.genetic import SymbolicRegressor
data = pd.read_csv("D:/Subjects/Thesis/snort_rules/ransomware_dataset.csv")
x_train = data.iloc[:,0:23]
y_train = data.iloc[:,:-1]
gp = SymbolicRegressor(population_size=11,
generations=2, stopping_criteria=0.01,
p_crossover=0.8, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.05,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0)
gp.fit(x_train, y_train)
print(gp._program)
The output is :
X7/(X15*(-X16*X20 - X19 + X2))