Drawing line of regression onto scatter graph in python - python

I am trying to draw the line of regression onto a scatter graph. I have two functions:
def place_y(x, slope, intercept):
return slope * x + intercept
def draw_line_of_regression():
"""The line of regression can be used to predict further values"""
import matplotlib.pyplot as plt # used to draw graph
from scipy import stats
# Example shows relationship between age and speed
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
slope, intercept, r, p, std_error = stats.linregress(age_x, speed_y)
# gets variables used in drawing the line of regression
line_of_regression = list(map(place_y(age_x, slope, intercept), age_x))
plt.scatter(age_x, speed_y) # draws scatter graph
plt.plot(age_x, line_of_regression)
plt.show() # shows the graph
draw_line_of_regression()
When this is run there is an error with the place_y() function. Error:
return slope * x + intercept
TypeError: can't multiply sequence by non-int of type 'numpy.float64

map() expects a function as the first argument while you're giving it place_y(age_x, slope, intercept) as the first argument (which throws error during execution because there's no multiplication of list and float defined). You have to pass the function itself, but "freeze" all arguments except x. To do that you can use functools.partial:
import functools
...
line_of_regression = list(map(functools.partial(place_y, slope=slope, intercept=intercept), age_x))
...
However, a better way to do the same is to utilize list comprehension:
...
line_of_regression = [place_y(x, slope, intercept) for x in age_x]
...
Even better is to leverage numpy's vectorized operations
...
age_x = np.array([5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6])
...
line_of_regression = age_x * slope + intercept
...

Related

Predicting y and x values using linear regressions

I am making a program to predict the x and y value using linear regression.
I can predict y from x. However, when trying to predict x given y i do not get the intended result. Output:
Given (x) predict (y):
x = 10
85.59308314937454
Given (y) predict (x):
y = 85
-45.75349521707133
code:
def place_y(x, slope, intercept):
return slope * x + intercept
def predict_value_x():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the speed of a car (y) given it is (x) years old"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (x) predict (y): \nx = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_x()
def predict_value_y():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the age of a car (x) given its speed (y)"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (y) predict (x): \ny = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)
y=ax+b -> x=(y-b)/a
The problem is that you try to solve by y twice.
You need an aditional function that solves by y:
def place_x(y, slope, intercept):
return (y - intercept)/slope
and replace placey in your predict_value_y function:
predicted = place_x(predict_value, slope, intercept)
the entire code could look like:
def place_y(x, slope, intercept):
return slope * x + intercept
def place_x(y, slope, intercept):
return (y - intercept)/slope
def predict_value_x():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the speed of a car (y) given it is (x) years old"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (x) predict (y): \nx = ")) # age of car(x)
predicted = place_y(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_x()
def predict_value_y():
"""Using the line of regression a value can be predicted based on a given value.
i.e. Predict the age of a car (x) given its speed (y)"""
from scipy import stats
age_x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6] # population
speed_y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86] # population
slope, intercept, r, p, std_err = stats.linregress(age_x, speed_y) # get stats values
predict_value = int(input("Given (y) predict (x): \ny = ")) # age of car(x)
predicted = place_x(predict_value, slope, intercept) # the speed of car given x
print(predicted)
predict_value_y()
The issue is with the place_y function, which is intended to predict y based on x, but you are using it to predict x based on y. The current implementation calculates y = slope * x + intercept, which doesn't return the correct result when trying to predict x from y. To predict x from y, you need to solve the equation y = slope * x + intercept for x: x = (y - intercept) / slope. Update the predict_value_y function in the line you calculate predicted:
predicted = (predict_value - intercept) / slope

Plotting a histogram from a database using matplot and python

So from the database, I'm trying to plot a histogram using the matplot lib library in python.
as shown here:
cnx = sqlite3.connect('practice.db')
sql = pd.read_sql_query('''
SELECT CAST((deliverydistance/1)as int)*1 as bin, count(*)
FROM orders
group by 1
order by 1;
''',cnx)
which outputs
This
From the sql table, I try to extract the columns using a for loop and place them in array.
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
print(distance)
print(counts)
OUTPUT:
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
When I plot a histogram
plt.hist(counts,bins=distance)
I get this out put:
click here
My question is, how do I make it so that the count is on the Y axis and the distance is on the X axis? It doesn't seem to allow me to put it there.
you could also skip the for loop and plot direct from your pandas dataframe using
sql.bin.plot(kind='hist', weights=sql['count(*)'])
or with the for loop
import matplotlib.pyplot as plt
import pandas as pd
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
plt.hist(distance, bins=distance, weights=counts)
You can skip the middle section where you count the instances of each distance. Check out this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'distance':np.round(20 * np.random.random(100))})
df['distance'].hist(bins = np.arange(0,21,1))
Pandas has a built-in histogram plot which counts, then plots the occurences of each distance. You can specify the bins (in this case 0-20 with a width of 1).
If you are not looking for a bar chart and are looking for a horizontal histogram, then you are looking to pass orientation='horizontal':
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# plt.style.use('dark_background')
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
plt.hist(counts,bins=distance, orientation='horizontal')
Use :
plt.bar(distance,counts)

Piecewise Fit not working - large dataset

I have been using a solution found in several places on stack overflow for fitting a piecewise function:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y)
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
(for example, here: How to apply piecewise linear fit in Python?)
The first time I try it in the console I get an OptimizeWarning.
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
After that I just get a straight line for my fit. It seems as though there is clearly a bend in the data that the fit isn't following, although I cannot figure out why.
For the dataset I am using there are about 3200 points in each x and y, is this part of the problem?
Here are some fake data that kind of simulate mine (same problem occurs where fit is not piecewise):
x = np.append(np.random.uniform(low=10.0, high=40.2, size=(1500,)), np.random.uniform(low=-10.0, high=20.2, size=(1500,)))
y = np.append(np.random.uniform(low=-3000, high=0, size=(1500,)), np.random.uniform(low=-2000, high=1000, size=(1500,)))
Just to complete the question with the answer provided in the comment above:
The issue was not due to the large number of points, but the fact that I had such large values on my y axis. Since the default initial values are 1, my values of around 1000 were too large. To fix that an initial guess for the line fit was used for parameter p0. From the docs for scipy.optimize.curve_fit it looks like:
p0 : None, scalar, or N-length sequence, optional
Initial guess for the parameters. If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
So my final code ended up looking like this:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([500, 700, 900, 1100, 1300, 1500, 2892, 4281, 5670, 7059, 8447, 9836, 11225, 12614, 14003])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y, p0=(10, -2500, 0, -500))
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
Just for fun (very scattered case) :
Since the original data was not available, the coordinates of the points are obtained from the figure published in the Rachel W's question, thanks to a graphical scan and the record of the blue pixels. They are some artefact due to the straight line and the grid which, after scanning, appear in white.
The result of a piecewise regression (two segments) is drawn in red on the above figure.
The equation of the fitted function is :
The regression method used is not iterative and don't require initial guess. The code is very simple : pp.12-13 in this paper https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf

Piecewise regression python

I am trying to do a piecewise linear regression in Python and the data looks like this,
I need to fit 3 lines for each section. Any idea how? I am having the following code, but the result is shown below. Any help would be appreciated.
import numpy as np
import matplotlib
import matplotlib.cm as cm
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy import optimize
def piecewise(x,x0,x1,y0,y1,k0,k1,k2):
return np.piecewise(x , [x <= x0, np.logical_and(x0<x, x< x1),x>x1] , [lambda x:k0*x + y0, lambda x:k1*(x-x0)+y1+k0*x0 lambda x:k2*(x-x1) y0+y1+k0*x0+k1*(x1-x0)])
x1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y1 = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y1 = np.flip(y1,0)
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15,16,17,18,19,20,21], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03,145,147,149,151,153,155])
y = np.flip(y,0)
perr_min = np.inf
p_best = None
for n in range(100):
k = np.random.rand(7)*20
p , e = optimize.curve_fit(piecewise, x1, y1,p0=k)
perr = np.sum(np.abs(y1-piecewise(x1, *p)))
if(perr < perr_min):
perr_min = perr
p_best = p
xd = np.linspace(0, 21, 100)
plt.figure()
plt.plot(x1, y1, "o")
y_out = piecewise(xd, *p_best)
plt.plot(xd, y_out)
plt.show()
data with fit
Thanks.
A very simple method (without iteration, without initial guess) can solve this problem.
The method of calculus comes from page 30 of this paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf (copy below).
The next figure shows the result :
The equation of the fitted function is :
Or equivalently :
H is the Heaviside function.
In addition, the details of the numerical calculus are given below :

Python: Finding a trend in a set of numbers

I have a list of numbers in Python, like this:
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
What's the best way to find the trend in these numbers? I'm not interested in predicting what the next number will be, I just want to output the trend for many sets of numbers so that I can compare the trends.
Edit: By trend, I mean that I'd like a numerical representation of whether the numbers are increasing or decreasing and at what rate. I'm not massively mathematical, so there's probably a proper name for this!
Edit 2: It looks like what I really want is the co-efficient of the linear best fit. What's the best way to get this in Python?
Possibly you mean you want to plot these numbers on a graph and find a straight line through them where the overall distance between the line and the numbers is minimized? This is called a linear regression
def linreg(X, Y):
"""
return a,b in solution to y = ax + b such that root mean square distance between trend line and original points is minimized
"""
N = len(X)
Sx = Sy = Sxx = Syy = Sxy = 0.0
for x, y in zip(X, Y):
Sx = Sx + x
Sy = Sy + y
Sxx = Sxx + x*x
Syy = Syy + y*y
Sxy = Sxy + x*y
det = Sxx * N - Sx * Sx
return (Sxy * N - Sy * Sx)/det, (Sxx * Sy - Sx * Sxy)/det
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
a,b = linreg(range(len(x)),x) //your x,y are switched from standard notation
The trend line is unlikely to pass through your original points, but it will be as close as possible to the original points that a straight line can get. Using the gradient and intercept values of this trend line (a,b) you will be able to extrapolate the line past the end of the array:
extrapolatedtrendline=[a*index + b for index in range(20)] //replace 20 with desired trend length
The Link provided by Keith or probably the answer from Riaz might help you to get the poly fit, but it is always recommended to use libraries if available, and for the problem in your hand, numpy provides a wonderful polynomial fit function called polyfit . You can use polyfit to fit the data over any degree of equation.
Here is an example using numpy to fit the data in a linear equation of the form y=ax+b
>>> data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> x = np.arange(0,len(data))
>>> y=np.array(data)
>>> z = np.polyfit(x,y,1)
>>> print "{0}x + {1}".format(*z)
4.32527472527x + 17.6
>>>
similarly a quadratic fit would be
>>> print "{0}x^2 + {1}x + {2}".format(*z)
0.311126373626x^2 + 0.280631868132x + 25.6892857143
>>>
Here is one way to get an increasing/decreasing trend:
>>> x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> trend = [b - a for a, b in zip(x[::1], x[1::1])]
>>> trend
[22, -5, 9, -4, 17, -22, 5, 13, -13, 21, 39, -26, 13]
In the resulting list trend, trend[0] can be interpreted as the increase from x[0] to x[1], trend[1] would be the increase from x[1] to x[2] etc. Negative values in trend mean that value in x decreased from one index to the next.
You could do a least squares fit of the data.
Using the formula from this page:
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
N = len(y)
x = range(N)
B = (sum(x[i] * y[i] for i in xrange(N)) - 1./N*sum(x)*sum(y)) / (sum(x[i]**2 for i in xrange(N)) - 1./N*sum(x)**2)
A = 1.*sum(y)/N - B * 1.*sum(x)/N
print "%f + %f * x" % (A, B)
Which prints the starting value and delta of the best fit line.
I agree with Keith, I think you're probably looking for a linear least squares fit (if all you want to know is if the numbers are generally increasing or decreasing, and at what rate). The slope of the fit will tell you at what rate they're increasing. If you want a visual representation of a linear least squares fit, try Wolfram Alpha:
http://www.wolframalpha.com/input/?i=linear+fit+%5B12%2C+34%2C+29%2C+38%2C+34%2C+51%2C+29%2C+34%2C+47%2C+34%2C+55%2C+94%2C+68%2C+81%5D
Update: If you want to implement a linear regression in Python, I recommend starting with the explanation at Mathworld:
http://mathworld.wolfram.com/LeastSquaresFitting.html
It's a very straightforward explanation of the algorithm, and it practically writes itself. In particular, you want to pay close attention to equations 16-21, 27, and 28.
Try writing the algorithm yourself, and if you have problems, you should open another question.
You can find the OLS coefficient using numpy:
import numpy as np
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = []
x.append(range(len(y))) #Time variable
x.append([1 for ele in xrange(len(y))]) #This adds the intercept, use range in Python3
y = np.matrix(y).T
x = np.matrix(x).T
betas = ((x.T*x).I*x.T*y)
Results:
>>> betas
matrix([[ 4.32527473], #coefficient on the time variable
[ 17.6 ]]) #coefficient on the intercept
Since the coefficient on the trend variable is positive, observations in your variable are increasing over time.
You can use simply scipy library
from scipy.stats import linregress
data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = np.arange(1,len(data)+1)
y=np.array(data)
res = linregress(x, y)
print(f'Equation: {res[0]:.3f} * t + {res[1]:.3f}, R^2: {res[2] ** 2:.2f} ')
res
Output:
Equation: 4.325 * t + 13.275, R^2: 0.66
LinregressResult(slope=4.325274725274725, intercept=13.274725274725277, rvalue=0.8096297800892154, pvalue=0.0004497809466484867, stderr=0.9051717124425395, intercept_stderr=7.707259409345618)
Compute the beta coefficient.
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = range(1,len(y)+1)
def var(X):
S = 0.0
SS = 0.0
for x in X:
S += x
SS += x*x
xbar = S/float(len(X))
return (SS - len(X) * xbar * xbar) / (len(X) -1.0)
def cov(X,Y):
n = len(X)
xbar = sum(X) / n
ybar = sum(Y) / n
return sum([(x-xbar)*(y-ybar) for x,y in zip(X,Y)])/(n-1)
def beta(x,y):
return cov(x,y)/var(x)
print beta(x,y) #4.34285714286

Categories