How to iterate over rows in dataset for distance calculation

How to iterate over rows in dataset for distance calculation - python

i have Iris dataset and i want to calculate the distance between all rows, i.e. 0 and 1, 0 and 2..... , 1 and 2, 1 and 3.... for KNN.
my code:
import numpy as np
from sklearn import datasets
import pandas as pd
#1 Handle the data
iris = datasets.load_iris()
x = iris.data[:, :4]
y = iris.target.reshape((150,1))
def shuffle(x,y,percentage):
iris_data = np.concatenate((x,y), axis=1)
shuffling = iris_data[np.random.permutation(len(iris_data))]
train, test = np.split(shuffling,[int(percentage*len(iris_data))])
x_train = train[:, :4]
y_train = train[:, -1]
x_test = test[:, :4]
y_test = test[:, -1]
return [iris_data, x_train, y_train, x_test, y_test]
shuf = shuffle(x,y,0.7)
x_train= shuf[1]; y_train= shuf[2]
x_test= shuf[3]; y_test= shuf[4]
#2 Distance function
def distance(x, x_test, y, y_test):
cont= 0
dist = {}
for i in range(x.shape[0]):
for j in range(x.shape[0]):
cont += (x[i] - x_test[j])**2
dist[i] = (np.sqrt(cont), y[i])
return dist
but i get a dictionary with numpy arrays (4,) instead of array of scalars.
i tried to use itertools.combinations but i have some errors.
one more question, how can i store my output in dataframe with the distances and the lables instead of dict (dist = {}) ?
thank you

Related

Accessing corresponding timeSeries data after Support Vector Regression

I am trying to train my data for forecasting by support vector regression. I exclude time series before doing regression because time is not an input. But I need time data in plotting and should be in order. When it comes to plotting, I am having a problem with getting actual data time index values. I need actual corresponding time series for y_test and y_pred. When I tried to get original datetime index, plotting is not correct and not in the date order corresponding with the y series.
The output should be time(in order such as from 01/01/2021 to 31/12/2021) vs y_pred and y_test.
Here is my dataset: https://github.com/ozgurylc/Dataset
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
dataset = pd.read_csv('Combined_MET_PV_data.csv')
# takes necessary columns
df = dataset[['referenceTime', 'dew_point_temp', 'air_temp', 'relative_humidity',
'irradiance', 'wind_speed', 'wind_category',
'hour_harmonic', 'AC_Power_IV2']]
print(df)
X = df.iloc[:, :-1].values # does not take Power
y = df.iloc[:, -1].values # only takes Power
print(X)
print(y)
print(X.shape, y.shape)
y = np.reshape(y, (-1,1))
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print("Train Shape: {} {} \nTest Shape: {} {}".format(X_train.shape, y_train.shape,
X_test.shape, y_test.shape))
X_train = X_train[:, 1:] # excludes referenceTime from X_train
X_test1 = X_test[:, 1:] #excludes referenceTime fron X_test
print(X_test[:, 0].tolist()) # this keeps referenceTime
print(X_test)
Here is where regression is done:
# scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test1 = sc_X.transform(X_test1)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.transform(y_test)
y_train = y_train.reshape((-1,))
svr_linear = svm.SVR(kernel='rbf')
svr_linear.fit(X_train, y_train)
print(svr_linear.score(X_test1, y_test))
y_pred = svr_linear.predict(X_test1)
print(y_pred)
# in the following code X_test[:,0] where time index is kept.
plot_1 = plt.plot(X_test[:, 0], y_test, color='red', linewidth=2)
plot_2 = plt.plot(X_test[:, 0], y_pred, color='blue', linewidth=2, linestyle='--')
plt.show()

Linear Regressor unable to predict a set of values; Error: ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)

I have 2 numpy arrays:
x= np.linspace(1,10,100) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
I want to train a Linear Regressor using these datasets. To compare the relationship between complexity & generalization, I using h Polynomial features preprocessing for a set of 4 degrees (1, 3, 6, 9).
After fitting the model, I want to test on an array x = np.linspace(1, 10, 100)
After much trying, I figured out that the x and y arrays need to be reshaped, and I did that. However, when I create the new x dataset to be predicted, it complains that the dimensions are not aligned. The estimator is working on the test-split from the original x array.
Below is my code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def fn_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x_predict = np.linspace(0,10,100)
x_predict = x_predict.reshape(-1, 1)
degrees = [1, 3, 6, 9]
predictions = []
for i, deg in enumerate(degrees):
linReg = LinearRegression()
pf = PolynomialFeatures(degree=deg)
xt = x.reshape(-1, 1)
yt = y.reshape(-1, 1)
X_transformed = pf.fit_transform(xt)
X_train_transformed, X_test_transformed, y_train_temp, y_test_temp = train_test_split(X_transformed, yt, random_state=0)
linReg.fit(X_train_transformed, y_train_temp)
predictions.append(linReg.predict(x_predict))
np.array(predictions)
return predictions
The shapes of the different arrays (# degree 3 in the loop)
x_predict = (100, 1)
xt = 100, 1
yt = 100, 1
X_train_transformed = 75, 4
y_train_temp = 75, 1
X_test_transformed = 25, 4
y_train_temp = 25, 1
predictions for X_test_transformed = 4, 25, 1
predictions for x_predict = Not working:
Error = ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) !=
2 (dim 0)

You forgot to transform your x_predict. I have updated your code below:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def fn_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x_predict = np.linspace(0,10,100)
x_predict = x_predict.reshape(-1, 1)
degrees = [1, 3, 6, 9]
predictions = []
for i, deg in enumerate(degrees):
linReg = LinearRegression()
pf = PolynomialFeatures(degree=deg)
xt = x.reshape(-1, 1)
yt = y.reshape(-1, 1)
X_transformed = pf.fit_transform(xt)
X_train_transformed, X_test_transformed, y_train_temp, y_test_temp = train_test_split(X_transformed, yt, random_state=0)
linReg.fit(X_train_transformed, y_train_temp)
x_predict_transformed = pf.fit_transform(x_predict)
predictions.append(linReg.predict(x_predict_transformed))
np.array(predictions)
return predictions
And now when you call fn_one() you will get the predictions.

Regression with Python (nympy/pandas) [duplicate]

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?

The return statement should be performed after the for is done, so it should be indented under the for, not further in.

At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

how to solve ? x and y must have same first dimension

from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
r = pd.read_csv("vitalsign_test.csv")
clm_list = []
for column in r.columns:
clm_list.append(column)
X = r[clm_list[1:len(clm_list)-1]].values
y = r[clm_list[len(clm_list)-1]].values
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size = 0.3, random_state=4)
k_range = range(1,25)
scores = []
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))
plt.plot(k_range,scores)
plt.xlabel('value of k for clf')
plt.ylabel('testing accuracy')
reponse that I am getting is
ValueError: x and y must have same first dimension
my feature and response shape is:
y.shape
Out[60]: (500,)
X.shape
Out[61]: (500, 6)

It has nothing to do with your X and y, it is about x and y arguments to plot, since your scores has one element, and k_range has 25. The error is incorrect indentation:
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))
should be
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))

In SelectKBest, what does length of get_support() represent?

When reproducing this cross-validation example, I get for a 2x4 train matrix (xtrain) a len(b.get_support()) of 1 000 000. Does this mean 1 000 000 features have been created in the model? Or only 2, as the number of features that have an impact is 2. Thanks!
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
### create data
def hidden_model(x):
#y is a linear combination of columns 5 and 10...
result = x[:, 5] + x[:, 10]
#... with a little noise
result += np.random.normal(0, .005, result.shape)
return result
def make_x(nobs):
return np.random.uniform(0, 3, (nobs, 10 ** 6))
x = make_x(20)
y = hidden_model(x)
scores = []
clf = LinearRegression()
for train, test in KFold(len(y), n_folds=5):
xtrain, xtest, ytrain, ytest = x[train], x[test], y[train], y[test]
b = SelectKBest(f_regression, k=2)
b.fit(xtrain,ytrain)
xtrain = xtrain[:, b.get_support()] #get_support: get mask or integer index of selected features
xtest = xtest[:, b.get_support()]
print len(b.get_support())
clf.fit(xtrain, ytrain)
scores.append(clf.score(xtest, ytest))
yp = clf.predict(xtest)
plt.plot(yp, ytest, 'o')
plt.plot(ytest, ytest, 'r-')
plt.xlabel('Predicted')
plt.ylabel('Observed')
print("CV Score (R_square) is", np.mean(scores))

It represents the mask that can be applied to your x to get the features that have been selected using the SelectKBest routine.
print x.shape
print b.get_support().shape
print np.bincount(b.get_support())
Outputs:
(20, 1000000)
(1000000,)
[999998 2]
Which shows you have 20 examples of 1000000 dimensional data, a boolean array of length 1000000 of which only two are ones.
Hope that helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to iterate over rows in dataset for distance calculation - python

Related

Accessing corresponding timeSeries data after Support Vector Regression

Linear Regressor unable to predict a set of values; Error: ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)

Regression with Python (nympy/pandas) [duplicate]

how to solve ? x and y must have same first dimension

In SelectKBest, what does length of get_support() represent?

Categories

Resources