How to split data to train and test ? Cross validation possible ? M-estimation or OLS? - python

I have 26 observations to apply a simple linear regression but when I split the data to 70% for train and 30% for test data usually the results for the test data (R squared / P value) are not good. Is it because the samples for the test are too small ? 8 or 9 observation are not enough ? What should I do ? no random state so he the algorithm choose the data randomly
Also wondering how to choose between OLS and M-estimation(which is more resistant to outliers which I have on my data check below because Variable B is impacted by other variables except A) to apply for my dataset.
this is the code I have done so far and looking to do cross validation in the train data.
Is it possible according to the number of observations I have?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\PL32_PMM_03_09_2018_SP_Level.xlsx",'Sheet1')
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
X = data1.iloc[0:len(data1),1].values.reshape(-1, 1)
Y = data1.iloc[0:len(data1),2].values.reshape(-1, 1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.33)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(), Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 =
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 =
This is an example of the data I have and my goal is not to predict a good model but to describe the impact of variable A on B. Also when analyzing the whole data together results are always better than splitting the data
Variable A Variable B
87.000 573.000
90.000 99.000
258.000 339.000
180.000 618.000
0 69.000
90.000 621.000
90.000 231.000
210.000 345.000
255.000 255.000
0 0
213.000 372.000
405.000 405.000
162.000 162.000
405.000 405.000
0 186.000
105.000 252.000
474.000 501.000
531.000 531.000
549.000 549.000
525.000 525.000
360.000 660.000
546.000 546.000
645.000 645.000
561.000 600.000
978.000 1.104.000
960.000 960.000
Also, plotted the results using SKlearn and analyzing the results based on the statsmodels. Can I assume that the plotted results are represented by the values due to statsmodels or there is something to change in the code ?

Y=df["Column name"]
X=df[[ "All other Columns"]]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
Good luck


XGBoost can't predict a simple sinusoidal function

I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42),y_train)
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
yXGBPredicted = XGB_reg.predict(X_test)
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare'fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.

muliple linear regression, traing dataset graphs ,ValueError: x and y must be the same size

i am running following code, graph for training dataset is giving error,
import pandas as pd
import numpy as np
df = pd.read_csv('11.csv')
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43
x = df.drop(['PE'], axis = 1).values
y = df['PE'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
ml = LinearRegression(), y_train)
y_pred = ml.predict(x_test)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, ml.predict(x_test), color = 'green') ***
please help to reshape 2d to 1d array for plotting graphs
**ValueError: x and y must be the same size**
EDIT: Now that your question has it's format fixed, I'm spotting a few errors, with a theme of using 1D linear regression code to plot your multiple regression problem.
plt.scatter(x_train, y_train, color = 'red'): You're trying to plot multiple variables in one axis (AT, V, AP, RH) using x_train. You cannot do this since this is multiple linear regression. (For example, one can't fit pressure and volume on the x-axis against temperature on the y. What does the x-axis represent? It doesn't make sense.) You cannot plot what you are trying to plot, and I cannot give you suggestions since I don't know what you're trying to plot. You can try one variable at a time, e.g. plt.scatter(x_train['AT'], y_train, color='red'). Or you use different color to plot each variable on the same graph - though I don't recommend this since your x-axis could be of different units.
plt.plot(x_train, ml.predict(x_test): You should be using y_test for your x-input. E.g. plt.plot(y_test, ml.predict(x_test)). This is a problem with the length of your data, not your width/columns like the error above. Though if my suggestion isn't what you wanted (it's a little strange to plot y_test and your y predictions), you might be inputting (incorrectly) assumptions/code for 1D linear regression when you're working with multiple linear regression - a potential theme in these errors.

SVR/SVM output predictions are very similar to each other but far from true value

The main idea is to predict 2 target output, based on input features.
the input features are already scaled using Standardscaler() from sklearn.
size of X_train is (190 x 6), Y_train = (190 x 2). X_test is (20 x 6), Y_test = (20x2)
linear and rbf kernel also make use of GridsearchCV to find the best C (linear), gamma and C ('rbf')
[PROBLEM] I perform SVR utilizing MultiOutputRegressor on both linear and rbf kernel but, the predicted outputs are very similar to each other (not exactly constant prediction) and pretty far from the true value of y.
Below are the plots where the scatter plot represent the true value of Y. first picture correspond to result of first target, Y[:,0]. while second picture is second target, Y[:,1].
Do i have to scale my target output? Any other model that could help improving test accuracy?
I have tried random forest regressor and perform tuning as well, and test accuracy is about similar to what I'm getting with SVR. (below result from SVR)
Best parameter: {'estimator__C': 1}
MAE: [18.51151192 9.604601 ] #from linear kernel
Best parameter (rbf): {'estimator__C': 1, 'estimator__gamma': 1e-09}
MAE (rbf): [17.80482033 9.39780134] #from rbf kernel
Thankyou so much! any help and input is greatly appreciated!! ^__^
---------------- Code -----------------------------
import numpy as np
from numpy import load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=3)
#input features - HR, HRV, PTT, breathing_rate, LASI, AI
X = load('200_patient_input_scaled.npy')
#Output features - SBP, DBP
Y = load('200_patient_output_raw.npy')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.095, random_state = 43)
epsilon = 0.1
#--------------------------- Linear SVR kernel Model ------------------------------------------------------
linear_svr = SVR(kernel='linear', epsilon = epsilon)
multi_output_linear_svr = MultiOutputRegressor(linear_svr), Y_train) #just to see the output
#GridSearch - find the best C
grid = {'estimator__C': [1,10,10,100,1000] }
grid_linear_svr = GridSearchCV(multi_output_linear_svr, grid, scoring='neg_mean_absolute_error', cv=rkf, refit=True), Y_train)
Y_predict = grid_linear_svr.predict(X_test)
print("\nBest parameter:", grid_linear_svr.best_params_ )
print("MAE:", mean_absolute_error(Y_predict,Y_test, multioutput='raw_values'))
#-------------------------- RBF SVR kernel Model --------------------------------------------------------
rbf_svr = SVR(kernel='rbf', epsilon = epsilon)
multi_output_rbf_svr = MultiOutputRegressor(rbf_svr)
#Grid search - Find best combination of C and gamma
grid_rbf = {'estimator__C': [1,10,10,100,1000], 'estimator__gamma': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2] }
grid_rbf_svr = GridSearchCV(multi_output_rbf_svr, grid_rbf, scoring='neg_mean_absolute_error', cv=rkf, refit=True), Y_train)
Y_predict_rbf = grid_rbf_svr.predict(X_test)
print("\nBest parameter (rbf):", grid_rbf_svr.best_params_ )
print("MAE (rbf):", mean_absolute_error(Y_predict_rbf,Y_test, multioutput='raw_values'))
plot_y_predict = Y_predict_rbf[:,1]
plt.scatter( np.linspace(0, 20, num = 20), Y_test[:,1], color = 'red')
plt.plot(np.linspace(0, 20, num = 20), plot_y_predict)
A common mistake is that when people use StandardScaler they use it along the wrong axis of the data. You may scale all the data, or row by row instead of column by column, please make sure you've done this right! I would do this by hand to be sure because else I think it needs different StandardScaler fit for each feature.
[RESPONSE/EDIT]: I think that just negates what StandardScaler did by inversing the application. I'm not entirely sure of the StandardScaler behaviour I'm just saying all this out of experience and having trouble scaling multiple feature data. If i were you (for example for MInMax scaling) I would prefer something like this:
columnsX = X.shape[1]
for i in range(columnsX):
X[:, i] = (X[:, i] - X[:, i].min()) / (X[:, i].max() - X[:, i].min())

kNN algorithm on apple stock

I'm trying to create a kNN algorithm for stock prediction, with at least 80% correct predictions on the test data. I have a problem with the StandardScaler from sklearn. For some reason it says that there is a "typo" in the word "Scaler", which I find is weird. Does someone know how to solve this issue? If you find more mistakes in the code, please tell me how to fix them, but I think it should be mostly correct (some might be wrong). I want the polynomial line to show around a week in the future. I use data from a private API Key from, which is provided in JSON formatting. The data contains of EOD data (end of day) with a limit of 1000 days in Descending order.
# Exports API data to a csv file on my hardware and then I import the csv data after it's sorted
df.to_csv('Test_Sample.csv', index=False)
dataframe = pd.read_csv('Test_Sample.csv')
X = df.iloc[:, :-1].values
Y = df.iloc[:, 4].values
# 80% training data, 20% testing data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# Scale train and test data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #Here is the mistake, under scaler (Error code: 'Typo in the word scaler')
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Classify data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5), Y_train)
Y_pred = classifier.predict(X_test)
# Train and test result
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(Y_test, Y_pred))
print(confusion_matrix(Y_test, Y_pred))
# Scatter all the data points in a figure
import matplotlib.pyplot as plt
plt.scatter(X, Y, color='blue')
plt.title('Financial Instrument Predicted Price')
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X), Y)
plt.plot(X, poly.fit_transform(X), color='red')
ValueError: could not convert string to float: 'AAPL'
You don't have a typo, in the comments you said:
ValueError: could not convert string to float: 'AAPL'
The error is clear actually, you have a string in your dataset, and trying to normalize/standardize your data. For most of the algorithms you need to encode your strings into integers. Since you did not provide any data sample, you can do, before splitting you can check your dataframe with
if it contains strings.
Edit: Check if your first row is supposed to be your header, then you can do the following:
dataframe = pd.read_csv('Test_Sample.csv', header = 0)

Different accuracy for python (Scikit-Learn) and R (e1071)

For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = ""
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred =, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred =, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.
