Predicting y and x values using linear regressions - python

I am making a program to predict the x and y value using linear regression.
I can predict y from x. However, when trying to predict x given y i do not get the intended result. Output:
Given (x) predict (y):
x = 10
85.59308314937454
Given (y) predict (x):
y = 85
-45.75349521707133
code:
def place_y(x, slope, intercept):
return slope * x + intercept
def predict_value_x():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the speed of a car (y) given it is (x) years old"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (x) predict (y): \nx = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_x()
def predict_value_y():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the age of a car (x) given its speed (y)"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (y) predict (x): \ny = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)

y=ax+b -> x=(y-b)/a
The problem is that you try to solve by y twice.
You need an aditional function that solves by y:
def place_x(y, slope, intercept):
return (y - intercept)/slope
and replace placey in your predict_value_y function:
predicted = place_x(predict_value, slope, intercept)
the entire code could look like:
def place_y(x, slope, intercept):
return slope * x + intercept
def place_x(y, slope, intercept):
return (y - intercept)/slope
def predict_value_x():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the speed of a car (y) given it is (x) years old"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (x) predict (y): \nx = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_x()
def predict_value_y():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the age of a car (x) given its speed (y)"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (y) predict (x): \ny = ")) # age of car(x)
predicted = place_x(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_y()

The issue is with the place_y function, which is intended to predict y based on x, but you are using it to predict x based on y. The current implementation calculates y = slope * x + intercept, which doesn't return the correct result when trying to predict x from y. To predict x from y, you need to solve the equation y = slope * x + intercept for x: x = (y - intercept) / slope. Update the predict_value_y function in the line you calculate predicted:
predicted = (predict_value - intercept) / slope

Related

Drawing line of regression onto scatter graph in python

I am trying to draw the line of regression onto a scatter graph. I have two functions:
def place_y(x, slope, intercept):
return slope * x + intercept
def draw_line_of_regression():
"""The line of regression can be used to predict further values"""
import matplotlib.pyplot as plt # used to draw graph
from scipy import stats
# Example shows relationship between age and speed
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
slope, intercept, r, p, std_error = stats.linregress(age_x, speed_y)
# gets variables used in drawing the line of regression
line_of_regression = list(map(place_y(age_x, slope, intercept), age_x))
plt.scatter(age_x, speed_y) # draws scatter graph
plt.plot(age_x, line_of_regression)
plt.show() # shows the graph
draw_line_of_regression()
When this is run there is an error with the place_y() function. Error:
return slope * x + intercept
TypeError: can't multiply sequence by non-int of type 'numpy.float64
map() expects a function as the first argument while you're giving it place_y(age_x, slope, intercept) as the first argument (which throws error during execution because there's no multiplication of list and float defined). You have to pass the function itself, but "freeze" all arguments except x. To do that you can use functools.partial:
import functools
...
line_of_regression = list(map(functools.partial(place_y, slope=slope, intercept=intercept), age_x))
...
However, a better way to do the same is to utilize list comprehension:
...
line_of_regression = [place_y(x, slope, intercept) for x in age_x]
...
Even better is to leverage numpy's vectorized operations
...
age_x = np.array([5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6])
...
line_of_regression = age_x * slope + intercept
...

How perform unsupervised clustering on numbers in an Array using PyTorch

I got this array and I want to cluster/group the numbers into similar values.
An example of input array:
array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
expected result :
array([57,58,59,60,61]), ([78,79,80,81,82,83]), ([101,102,103,104,105,106])
I tried to use clustering but I don't think it's gonna work if I don't know how many I'm going to split up.
true = np.where(array>=1)
-> (array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102,
103, 104, 105, 106], dtype=int64),)
Dynamic binning requires explicit criteria and is not an easy problem to automate because each array may require a different set of thresholds to bin them efficiently.
I think Gaussian mixtures with a silhouette score criteria is the best bet you have. Here is a code for what you are trying to achieve. The silhouette scores help you determine the number of clusters/Gaussians you should use and is quite accurate and interpretable for 1D data.
import numpy as np
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106]
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
#change value of clusters to check best silhoutte score
print('Silhoutte scores')
scores = []
for n in range(2,11):
model = GaussianMixture(n).fit(data)
preds = model.predict(data)
score = silhouette_score(data, preds)
scores.append(score)
print(n,'->',score)
n_best = np.argmax(scores)+2 #because clusters start from 2
model = GaussianMixture(n_best).fit(data) #best model fit
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
#split data by clusters
pred = model.predict(data)
output = np.split(x, np.sort(np.unique(pred, return_index=True)[1])[1:])
print(output)
Silhoutte scores
2 -> 0.699444729378163
3 -> 0.8962176943475543 #<--- selected as nbest
4 -> 0.7602523591781903
5 -> 0.5835620702692205
6 -> 0.5313888070615105
7 -> 0.4457049486461251
8 -> 0.4355742296918767
9 -> 0.13725490196078433
10 -> 0.2159663865546218
This creates 3 gaussians with the following distributions to split the data into clusters.
Arrays output finally split by similar values
#output -
[array([57, 58, 59, 60, 61]),
array([78, 79, 80, 81, 82, 83]),
array([101, 102, 103, 104, 105, 106])]
You can perform kind of derivation on this array so that you can track changes better, assume your array is:
A = np.array([ 57, 58, 59, 60, 61, 78, 79, 80, 81, 82, 83, 101, 102, 103, 104, 105, 106])
so you can make a derivation vector by simply convolving your vector with [-1 1]:
A_ = abs(np.convolve(A, np.array([-1, 1])))
then A_ is:
array([57, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 18, 2, 1, 1, 1, 106]
now you can define a threshold like 5 and find the cluster boundaries.
THRESHOLD = 5
cluster_bounds = np.argwhere(A_ > THRESHOLD)
now cluster_bounds is:
array([[0], [5], [11], [16]], dtype=int32)

Find propotional sampling using python

I'm given a problem that explicitly asks me not to use numpy and pandas
Prob : Selecting an element from the list A randomly with probability proportional to its magnitude. assume we are doing the same experiment for 100 times with replacement, in each experiment you will print a number that is selected randomly from A.
Ex 1: A = [0 5 27 6 13 28 100 45 10 79]
let f(x) denote the number of times x getting selected in 100 experiments.
f(100) > f(79) > f(45) > f(28) > f(27) > f(13) > f(10) > f(6) > f(5) > f(0)
Initially, I took the sum of all the elements of list A
I then divided (in order to normaliz) each element of list A by the sum and stored each of these values in another list (d_dash)
I then created another empty list (d_bar), that takes in cumalative sum of all elements of d_dash
created variable r, where r= random.uniform(0.0,1.0), and then for the length of d_dash comapring r to d_dash[k], if r<=d_dash[k], return A[k]
However, I'm getting the error list index out of range near d_dash[j].append((A[j]/sum)), not sure what is the issue here as I did not exceed the index of either d_dash or A[j].
Also, is my logic correct ? sharing a better way to do this would be appreciated.
Thanks in advance.
import random
A = [0,5,27,6,13,28,100,45,10,79]
def propotional_sampling(A):
sum=0
for i in range(len(A)):
sum = sum + A[i]
d_dash=[]
for j in range(len(A)):
d_dash[j].append((A[j]/sum))
#cumulative sum
d_bar =[]
d_bar[0]= 0
for k in range(len(A)):
d_bar[k] = d_bar[k] + d_dash[k]
r = random.uniform(0.0,1.0)
number=0
for p in range(len(d_bar)):
if(r<=d_bar[p]):
number=d_bar[p]
return number
def sampling_based_on_magnitued():
for i in range(1,100):
number = propotional_sampling(A)
print(number)
sampling_based_on_magnitued()
Below is the code to do the same :
A = [0, 5, 27, 6, 13, 28, 100, 45, 10, 79]
#Sum of all the elements in the array
S = sum(A)
#Calculating normalized sum
norm_sum = [ele/S for ele in A]
#Calculating cumulative normalized sum
cum_norm_sum = []
cum_norm_sum.append(norm_sum[0])
for itr in range(1, len(norm_sum), 1) :
cum_norm_sum.append(cum_norm_sum[-1] + norm_sum[itr])
def prop_sampling(cum_norm_sum) :
"""
This function returns an element
with proportional sampling.
"""
r = random.random()
for itr in range(len(cum_norm_sum)) :
if r < cum_norm_sum[itr] :
return A[itr]
#Sampling 1000 elements from the given list with proportional sampling
sampled_elements = []
for itr in range(1000) :
sampled_elements.append(prop_sampling(cum_norm_sum))
Below image shows the frequency of each element in the sampled points :
Clearly the number of times each elements appears is proportional to its magnitude.
Cumulative sum can be computed by itertools.accumulate. The loop:
for p in range(len(d_bar)):
if(r<=d_bar[p]):
number=d_bar[p]
can be substituted by bisect.bisect() (doc):
import random
from itertools import accumulate
from bisect import bisect
A = [0,5,27,6,13,28,100,45,10,79]
def propotional_sampling(A, n=100):
# calculate cumulative sum from A:
cum_sum = [*accumulate(A)]
# cum_sum = [0, 5, 32, 38, 51, 79, 179, 224, 234, 313]
out = []
for _ in range(n):
i = random.random() # i = [0.0, 1.0)
idx = bisect(cum_sum, i*cum_sum[-1]) # get index to list A
out.append(A[idx])
return out
print(propotional_sampling(A))
Prints (for example):
[10, 100, 100, 79, 28, 45, 45, 27, 79, 79, 79, 79, 100, 27, 100, 100, 100, 13, 45, 100, 5, 100, 45, 79, 100, 28, 79, 79, 6, 45, 27, 28, 27, 79, 100, 79, 79, 28, 100, 79, 45, 100, 10, 28, 28, 13, 79, 79, 79, 79, 28, 45, 45, 100, 28, 27, 79, 27, 45, 79, 45, 100, 28, 100, 100, 5, 100, 79, 28, 79, 13, 100, 100, 79, 28, 100, 79, 13, 27, 100, 28, 10, 27, 28, 100, 45, 79, 100, 100, 100, 28, 79, 100, 45, 28, 79, 79, 5, 45, 28]
The reason you got "list index out of range" message is that you created an empty list "d_bar =[]" and the started assigning value to it "d_bar[k] = d_bar[k] + d_dash[k]". I recoomment using the followoing structor isntead:
First, define it in this way:
d_bar=[0 for i in range(len(A))]
Also, I believe this code will return 1 forever as there is no break in the loop. you can resolve this issue by adding "break". here is updated version of your code:
A = [0, 5, 27, 6, 13, 28, 100, 45, 10, 79]
def pick_a_number_from_list(A):
sum=0
for i in A:
sum+=i
A_norm=[]
for j in A:
A_norm.append(j/sum)
A_cum=[0 for i in range(len(A))]
A_cum[0]=A_norm[0]
for k in range(len(A_norm)-1):
A_cum[k+1]=A_cum[k]+A_norm[k+1]
A_cum
r = random.uniform(0.0,1.0)
number=0
for p in range(len(A_cum)):
if(r<=A_cum[p]):
number=A[p]
break
return number
def sampling_based_on_magnitued():
for i in range(1,100):
number = pick_a_number_from_list(A)
print(number)
sampling_based_on_magnitued()

How to weight station to Order Least Squares in python?

I have 10 climate stations data about precipitation and it's DEM.
I had done a linear regression follow:
DEM = [200, 300, 400, 500, 600, 300, 200, 100, 50, 200]
Prep = [50, 95, 50, 59, 99, 50, 23, 10, 10, 60]
X = DEM #independent variable
Y = Prep #dependent variable
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
But now I want to add weight to those stations like:
Weight = [0.3, 0.1, 0.1, 0.1, 0.2, 0.05, 0.05, 0.05, 0.05, 0.05]
The diagram is like http://ppt.cc/XXrEv
I found Weighted Least Squares to do it, but I want to know how and why it work or if it is wrong.
import numpy as np
import statsmodels.api as sm
Y = [1, 3, 4, 5, 2, 3, 4]
X = range(1, 8)
X = sm.add_constant(X)
wls_model = sm.WLS(Y, X, weights=range(1, 8))
results = wls_model.fit()
results.params
Answer:
import numpy as np
import statsmodels.api as sm
start_time = time.time()
alist=[2,4,6]
DEM=[200,300,400,500,300,600]
PRE=[20,19,18,20,21,22,30,23]
A_DEM=[]
A_PRE=[]
W=[]
for a in alist:
A_DEM.append(DEM[a-1])
A_PRE.append(PRE[a-1])
W.append(1)
X = sm.add_constant(A_DEM)
Y = A_PRE
wls_model = sm.WLS(Y,X, weights=W).fit()
print wls_model.params[0] # intercept
print wls_model.params[1] # slope
print wls_model.rsquared #rsquared
print wls_model.summary()
And I found the WLS will auto normalize.So you can add weight direct.

Python: Finding a trend in a set of numbers

I have a list of numbers in Python, like this:
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
What's the best way to find the trend in these numbers? I'm not interested in predicting what the next number will be, I just want to output the trend for many sets of numbers so that I can compare the trends.
Edit: By trend, I mean that I'd like a numerical representation of whether the numbers are increasing or decreasing and at what rate. I'm not massively mathematical, so there's probably a proper name for this!
Edit 2: It looks like what I really want is the co-efficient of the linear best fit. What's the best way to get this in Python?
Possibly you mean you want to plot these numbers on a graph and find a straight line through them where the overall distance between the line and the numbers is minimized? This is called a linear regression
def linreg(X, Y):
"""
return a,b in solution to y = ax + b such that root mean square distance between trend line and original points is minimized
"""
N = len(X)
Sx = Sy = Sxx = Syy = Sxy = 0.0
for x, y in zip(X, Y):
Sx = Sx + x
Sy = Sy + y
Sxx = Sxx + x*x
Syy = Syy + y*y
Sxy = Sxy + x*y
det = Sxx * N - Sx * Sx
return (Sxy * N - Sy * Sx)/det, (Sxx * Sy - Sx * Sxy)/det
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
a,b = linreg(range(len(x)),x) //your x,y are switched from standard notation
The trend line is unlikely to pass through your original points, but it will be as close as possible to the original points that a straight line can get. Using the gradient and intercept values of this trend line (a,b) you will be able to extrapolate the line past the end of the array:
extrapolatedtrendline=[a*index + b for index in range(20)] //replace 20 with desired trend length
The Link provided by Keith or probably the answer from Riaz might help you to get the poly fit, but it is always recommended to use libraries if available, and for the problem in your hand, numpy provides a wonderful polynomial fit function called polyfit . You can use polyfit to fit the data over any degree of equation.
Here is an example using numpy to fit the data in a linear equation of the form y=ax+b
>>> data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> x = np.arange(0,len(data))
>>> y=np.array(data)
>>> z = np.polyfit(x,y,1)
>>> print "{0}x + {1}".format(*z)
4.32527472527x + 17.6
>>>
similarly a quadratic fit would be
>>> print "{0}x^2 + {1}x + {2}".format(*z)
0.311126373626x^2 + 0.280631868132x + 25.6892857143
>>>
Here is one way to get an increasing/decreasing trend:
>>> x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> trend = [b - a for a, b in zip(x[::1], x[1::1])]
>>> trend
[22, -5, 9, -4, 17, -22, 5, 13, -13, 21, 39, -26, 13]
In the resulting list trend, trend[0] can be interpreted as the increase from x[0] to x[1], trend[1] would be the increase from x[1] to x[2] etc. Negative values in trend mean that value in x decreased from one index to the next.
You could do a least squares fit of the data.
Using the formula from this page:
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
N = len(y)
x = range(N)
B = (sum(x[i] * y[i] for i in xrange(N)) - 1./N*sum(x)*sum(y)) / (sum(x[i]**2 for i in xrange(N)) - 1./N*sum(x)**2)
A = 1.*sum(y)/N - B * 1.*sum(x)/N
print "%f + %f * x" % (A, B)
Which prints the starting value and delta of the best fit line.
I agree with Keith, I think you're probably looking for a linear least squares fit (if all you want to know is if the numbers are generally increasing or decreasing, and at what rate). The slope of the fit will tell you at what rate they're increasing. If you want a visual representation of a linear least squares fit, try Wolfram Alpha:
http://www.wolframalpha.com/input/?i=linear+fit+%5B12%2C+34%2C+29%2C+38%2C+34%2C+51%2C+29%2C+34%2C+47%2C+34%2C+55%2C+94%2C+68%2C+81%5D
Update: If you want to implement a linear regression in Python, I recommend starting with the explanation at Mathworld:
http://mathworld.wolfram.com/LeastSquaresFitting.html
It's a very straightforward explanation of the algorithm, and it practically writes itself. In particular, you want to pay close attention to equations 16-21, 27, and 28.
Try writing the algorithm yourself, and if you have problems, you should open another question.
You can find the OLS coefficient using numpy:
import numpy as np
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = []
x.append(range(len(y))) #Time variable
x.append([1 for ele in xrange(len(y))]) #This adds the intercept, use range in Python3
y = np.matrix(y).T
x = np.matrix(x).T
betas = ((x.T*x).I*x.T*y)
Results:
>>> betas
matrix([[ 4.32527473], #coefficient on the time variable
[ 17.6 ]]) #coefficient on the intercept
Since the coefficient on the trend variable is positive, observations in your variable are increasing over time.
You can use simply scipy library
from scipy.stats import linregress
data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = np.arange(1,len(data)+1)
y=np.array(data)
res = linregress(x, y)
print(f'Equation: {res[0]:.3f} * t + {res[1]:.3f}, R^2: {res[2] ** 2:.2f} ')
res
Output:
Equation: 4.325 * t + 13.275, R^2: 0.66
LinregressResult(slope=4.325274725274725, intercept=13.274725274725277, rvalue=0.8096297800892154, pvalue=0.0004497809466484867, stderr=0.9051717124425395, intercept_stderr=7.707259409345618)
Compute the beta coefficient.
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = range(1,len(y)+1)
def var(X):
S = 0.0
SS = 0.0
for x in X:
S += x
SS += x*x
xbar = S/float(len(X))
return (SS - len(X) * xbar * xbar) / (len(X) -1.0)
def cov(X,Y):
n = len(X)
xbar = sum(X) / n
ybar = sum(Y) / n
return sum([(x-xbar)*(y-ybar) for x,y in zip(X,Y)])/(n-1)
def beta(x,y):
return cov(x,y)/var(x)
print beta(x,y) #4.34285714286

Categories