this is knn in R
data=read.csv("data.csv" , header =TRUE )
x= data.matrix(data[,1:47])
y= (data[,48]-1)
y <- as.factor(y)
training= sample(1:n, 0.7*n)
testing= sample(1:n)[-training]
knn <- knn3Train(data[training,], data[testing,],y[training], k=5)
this is knn in Python
data = pd.read_csv("data.csv")
x= data.drop(columns=['class'])
y=data['class'].values
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=500)
knearest = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn = knearest.fit(X_train, y_train)
I want the same result for knn in both languages but i couldn't make exact same parameters. i hope you have understood my question.
Knn is my first ever step in R and python. i have zero knowledge otherwise.
Related
I am trying to code a multiple linear regression problem using two different methods. One is the simple one as stated below:
from sklearn.model_selection import train_test_split
X = df[['geo','age','v_age']]
y = df['freq']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
The above code gives me an MSE of 0.46 and a Y2 score of '0.0012' which is really bad fit. Meanwhile when I use:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=1) #Degree = 1 should give the same equation as above code block
X_ = poly.fit_transform(X)
y = y.values.reshape(-1, 1)
predict_ = poly.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_, predict_, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
Using PolynomialFeatures gives me an MSE of 0.23 and a Y2 score of '0.5' which is much much better. I don't understand how two methods using the same regression equation give such different answers. Rest everything else is the same.
I have a function with a regression loop built in. I want to assign the rsquareds from each iteration to an object that I can print out later.
here's part of the function (including the regression) for brevity:
cuts = [stats, stats_po, stats_ic, stats_id, stats_h, stats_a, stats_bos, stats_bkn, stats_nyk, stats_phi, stats_tor, stats_chi, stats_cle, stats_det, stats_ind, stats_mil, stats_den, stats_min, stats_okc, stats_por, stats_uta, stats_gsw, stats_lac, stats_lal, stats_phx, stats_sac, stats_atl, stats_cha, stats_mia, stats_orl, stats_was, stats_dal, stats_hou, stats_mem, stats_nop, stats_sas, stats_o1, stats_o2, stats_d1, stats_d2, stats_l25]
def process_cuts(c):
c = c.dropna(axis=0,how='all')
n = c.team.str.rsplit(" ",n=1, expand=True)
c['city'] = n[0]
c['team_name']=n[1]
c['team_name']=c['team_name'].str.replace('Trailblazers','Blazers')
c['team_name']=c['team_name'].str.replace('Bobcats','Hornets')
for z in ['Points','Steals','Blocks','Assists','OReb','DefReb','Turnovers','FieldGoals','ThreeShots','FTP', 'Fouls','FTMiss','FGMiss','FreeThrows']:
y = mergered[z]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
from sklearn import metrics
rsquared = 'Rsquared:' + ' ' + z, metrics.r2_score(y_test,y_pred)
cuts_diffs = list(map(process_cuts, cuts))
I want to store the rsquareds for each y and print them out for each data cut.
appreciate your help
I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.
Has anybody tried to rich same results by implementing ElasticNetCV in Python and cvglmnet in R?
I have found out how to make it on ElasticNet in Python and glmnet in R but cannot reproduce it with cross validation methods...
Steps to reproduce in Python:
Preprocessing:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100000,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
pd.DataFrame(X_train).to_csv('X_train.csv', index=None)
pd.DataFrame(X_test).to_csv('X_test.csv', index=None)
pd.DataFrame(y_train).to_csv('y_train.csv', index=None)
pd.DataFrame(y_test).to_csv('y_test.csv', index=None)
Models:
model = ElasticNet(
alpha=1.0,
l1_ratio=0.5,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
copy_X=True,
tol=0.0000001,
warm_start=False,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
output: 42399.49815189786
model = ElasticNetCV(
l1_ratio=0.5,
eps=0.001,
n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
tol=0.0000001,
cv=10,
copy_X=True,
verbose=0,
n_jobs=-1,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
output: 39354.729173913176
Steps to reproduce in R:
Preprocssing:
library(glmnet)
X_train <- read.csv(path)
X_test <- read.csv(path)
y_train <- read.csv(path)
y_test <- read.csv(path)
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error = y_test - y_pred
mean(as.matrix(y_error)^2)
output: 42399.5
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
output: 37.00207
Thanks so much for providing the example, I am on a laptop so I had to reduce the number of samples to 100:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
When you do predict in with glmnet, you need to specify lambda, otherwise it returns the predictions for all lambdas, so in R:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
dim(y_pred)
[1] 25 89
When you run cv.glmnet, it selects the best lambda from cv, the lambda.1se, so it gives you only 1 set, which is the rmse you wanted:
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 22.03504
dim(y_error)
[1] 25 1
fit$lambda.1se
[1] 1.278699
If we select the lambda closest to that chosen by cv.glmnet in glmnet, you get back something in the correct range:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
sel = which.min(fit$lambda-1.278699)
y_pred <- predict(fit, newx = as.matrix(X_test))[,sel]
mean((y_test - y_pred)^2)
dim(y_error)
mean(as.matrix((y_test - y_pred)^2))
[1] 20.0775
Before we compare with sklearn, we need to make sure we are testing over the same range of lambdas.
L = c(0.01,0.05,0.1,0.2,0.5,1,2)
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train),lambda=L)
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 0.003065869
So we expect something in the range of 0.003065869. We run it with the same lambda, lambda is termed as alpha in ElasticNet. The alpha in glmnet is in fact your l1_ratio, see vignette. And the normalize option should be set to False, because:
If True, the regressors X will be normalized before regression by
subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
So we just run it using CV:
model = ElasticNetCV(l1_ratio=1,fit_intercept=True,alphas=[0.01,0.05,0.1,0.2,0.5,1,2])
model.fit(X=X_train,y=y_train)
y_pred = model.predict(X=X_test)
mean_squared_error(y_true=y_test,y_pred=y_pred)
0.0018007824874741929
It's around the same ball park as the R result.
And if you do it for ElasticNet, you will get the same result, if you specify alpha.
I'm using scikit-learn to train some classifiers. I do cross validation and then compute AUC. However I'm getting a different AUC number every time I run the tests although I made sure to use a seed and a RandomState. I want my tests to be deterministic. Here's my code:
from sklearn.utils import shuffle
SEED = 0
random_state = np.random.RandomState(SEED)
X, y = shuffle(data, Y, random_state=random_state)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=test_size, random_state=random_state)
clf = linear_model.LogisticRegression()
kfold = cross_validation.KFold(len(X), n_folds=n_folds)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
for train, test in kfold:
probas_ = clf.fit(X[train], Y[train]).predict_proba(X[test])
fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
mean_tpr /= len(kfold)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
My questions:
1- Is there something wrong in my code that's making the results different each time I run it?
2- Is there a global way to make scikit deterministic?
EDIT:
I just tried this:
test_size = 0.5
X = np.random.randint(10, size=(10,2))
Y = np.random.randint(2, size=(10))
SEED = 0
random_state = np.random.RandomState(SEED)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train # I recorded the result
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=6) #notice the change in random_state
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train #the result is different from the first one!!!!
As you see I'm getting different results although I used the same random_state! How to solve this?
LogisticRegression uses randomness internally and has an (undocumented, will fix in a moment) random_state argument.
There's no global way of setting the random state, because unfortunately the random state on LogisticRegression and the SVM code can only be set in a hacky way. That's because this code comes from Liblinear and LibSVM, which use the C standard library's rand function and that cannot be seeded in a principled way.
EDIT The above is true, but probably not the cause of the problem. You're threading a single np.random.RandomState through your calls, while you should pass the same integer seed for easy reproducibility.