Inconsistent neural network results when using sklean and seed - python

After running a neural network in sklearn, i am receiving inconsistant results, even after implementing the seed function. each time i run the code, i receive different values for MSE and R squared for each tested seed value. These values can range greatly with R squared being anything between -0.1 to 0.6. Im wondering if its a data issue as i only have 22 columns and 241 rows. Ive also tried setting
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001,random_state=0)
as well as changing the value of the random_state.
below is my code. Many thanks
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
import numpy as np
data=pd.read_csv(r'''D:\PhD\1styear\machinelearning\NNforF2050\DATAnnF2050.csv''')
print(data.shape)
print(data.dtypes)
x=data.drop('EnergyConsumpManuf',axis=1)
y=data['EnergyConsumpManuf']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
from sklearn.metrics import accuracy_score
from math import sqrt
for i in range(15):
print('np.random.seed(%d)'%(i))
np.random.seed(i)
mlp=MLPRegressor(hidden_layer_sizes=(22,22,22),max_iter=2000,learning_rate_init=0.001)
mlp.fit(x_train,y_train)
predictions=mlp.predict(x_test)
print('MSE train: ',metrics.mean_squared_error(y_test,predictions))
RMS=sqrt(metrics.mean_squared_error(y_test,predictions))
print('RMS',RMS)
RTWO=sklearn.metrics.r2_score(y_test,predictions)
print('RTWO',RTWO)
print('MAE',metrics.mean_absolute_error(y_test,predictions))

You need to set random_state parameter of train_test_split function as well. Without fixed random state, data is split randomly each time, that is why results change each time you run the code.

Related

How to make knn faster?

I have a dataset of shape(700000,20) and I want to apply KNN to it.
However on testing it takes really huge time,can someone expert please help to let me know how can I reduce the KNN predicting time.
Is there something like GPU-KNN or something.Please help to let me know.
Below is the code I am using.
import os
os.chdir(os.path.dirname(os.path.realpath(__file__)))
import tensorflow as tf
import pandas as pd
import numpy as np
from joblib import load, dump
import numpy as np
from scipy.spatial import distance
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from dtaidistance import dtw
window_length = 20
n = 5
X_train = load('X_train.pth').reshape(-1,20)
y_train = load('y_train.pth').reshape(-1)
X_test = load('X_test.pth').reshape(-1,20)
y_test = load('y_test.pth').reshape(-1)
#custom metric
def DTW(a, b):
return dtw.distance(a, b)
clf = KNeighborsClassifier(metric=DTW)
clf.fit(X_train, y_train)
#evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
I can suggest reducing the number of features which i think its 20 features from your dataset shape, Which mean you have 20 dimensions.
You can reduce the number of features by using PCA (Principal Component Analysis) like the following:
from sklearn.decomposition import PCA
train_data_pca = PCA(n_components=10)
reduced_train_data = train_data_pca.fit_transform(train_data)
this code will reduce deminisions for example to 10 instead of 20
Second issue in your code, that I see that you are not using th K neighboors value in the classifier, It should be as the following:
clf = KNeighborsClassifier(n_neighbors=n, metric=DTW)
The metric dtw is taking too much time while simple knn is working fine.

How to calculate the accuracy?

I'm trying to calculate the accuracy for a twitter sentiment analysis project. However, I get this error, and I was wondering if anyone could help me calculate the accuracy? Thanks
Error: ValueError: Classification metrics can't handle a mix of continuous and multiclass targets
My code:
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("updated_tweet_info.csv")
data = df.fillna(' ')
train,test = train_test_split(data, test_size = 0.2, random_state = 42)
train_clean_tweet=[]
for tweet in train['tweet']:
train_clean_tweet.append(tweet)
test_clean_tweet=[]
for tweet in test['tweet']:
test_clean_tweet.append(tweet)
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(train_clean_tweet)
test_features=v.transform(test_clean_tweet)
lr = RandomForestRegressor(n_estimators=200)
fit1 = lr.fit(train_features, train['clean_polarity'])
pred = fit1.predict(test_features)
accuracy = accuracy_score(pred, test['clean_polarity'])`
You are trying to use the accuracy_score method, but accuracy is a classification metric.
In your case, try using a regression metric method like: mean_squared_error() and then applying np.sqrt(). This will return you the Root Mean Squared Error. The lower the number, the better. You can also look here for more details.
Try this:
import numpy as np
rmse = np.sqrt(mean_squared_error(test['clean_polarity'], pred))
This guy also had the same problem

How to calculate r-squared with python?

I have fitted a model from which I'd like to know the scores (r-squared).
The data is split into a training and testing set. Although the model is only trained using the training set, how is it possible that my r-squared for my testing data is higher? I mean the model has never seen the testing set, but is more accurate than with the training set... Am I interpreting something wrong?
My code:
import pandas as pd
import numpy
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy
import sklearn
from sklearn.linear_model import LinearRegression
from scipy import stats
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
df=pd.read_csv("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DA0101EN/module_5_auto.csv")
df=df._get_numeric_data()
y_data = df['price']
x_data=df.drop('price',axis=1)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data,
test_size=0.15, random_state=1)
lr=LinearRegression()
lr.fit(x_train[['horsepower']], y_train)
h=lr.score(x_train[['horsepower']], y_train).mean()
h2=lr.score(x_test[['horsepower']], y_test).mean()
print(h,h2)
To put it simply, R-Squared is used to find the 'difference in percent' or calculate the accuracy of two time-series datasets.
Formula
Note: squaring Pearsons-r, squaring pandas corr(), or r^2 have slightly different results than R^2 formula shown above, this is due to 'statistic round up' reasons... refer to Max Pierini's answer
SciKit Learn R-squared is very different from square of Pearson's Correlation R
Method 1: function
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Method 2: sklearn library
from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
It looks like you're using scikit-learn. If so, you can use the r2_score metric.

ValueError: Found input variables with inconsistent numbers of samples: [4, 25001] Seems like header does not get recognized?

Does someone knows what is the problem? It seems like the column names are not recognized as header. Below is my code with in bold corresponding errors. I want to write a function that trains a logistic regression by splitting dataset into train and test sets (70% of data training and 30% of data for tests). Thank you in advance.
Import
import numpy as np
import pandas as pd
import csv
from sklearn.linear_model import Logistic Regression
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.datasets import data
from sklearn.model_selection import train_test_split
Load dataset, seprate data in columns and give columnnames:
colnames=["Watermark", "Micro-print", "Ultraviolet fields", "Magnetic fields", "Diameter","Target"]
Dataset=pd.read_csv("/Users/David/Documents/Python Assignment2/data-banknote.csv", sep=',', names=colnames)
Dataset.index=np.arange(1,len(Dataset)+1)
Define TrainData and TestData
TrainData= Dataset["Watermark"],Dataset["Micro-print"],Dataset["Ultraviolet fields"],Dataset["Magnetic fields"],Dataset["Diameter"]
TestData= Dataset["Target"]
Show head of the dataset
TrainData.head()
TestData.head()
Error is given that TrainData has no header?
Splitting the dataset
TrainData_train,TrainData_test,TestData_train,TestData_test = train_test_split(TrainData,TestData,test_size=0.3,random_state=0)
ValueError: Found input variables with inconsistent numbers of samples: [4, 25001]?

gplearn library for generating new lines of data from given dataset

I am using gplearn library (genetic programming) for generating new rules from the given dataset. Currently I have 11 rows of data with 24 columns(features) that I give as input to the SymbolicRegressor method to get new rules. However, I am only getting only one rule. Generally with crossover shouldn't I get 11 new rules if I give 11 lines of data as input. If I doing it wrong what is the right way of doing it ?
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import ExtraTreesRegressor
from gplearn.genetic import SymbolicRegressor
data = pd.read_csv("D:/Subjects/Thesis/snort_rules/ransomware_dataset.csv")
x_train = data.iloc[:,0:23]
y_train = data.iloc[:,:-1]
gp = SymbolicRegressor(population_size=11,
generations=2, stopping_criteria=0.01,
p_crossover=0.8, p_subtree_mutation=0.1,
p_hoist_mutation=0.05, p_point_mutation=0.05,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0)
gp.fit(x_train, y_train)
print(gp._program)
The output is :
X7/(X15*(-X16*X20 - X19 + X2))

Categories