Histogram output not matching input - python

Using matplotlib, I'm trying to generate a histogram from a list of values. My output looks like the image in the link shown here: http://i.stack.imgur.com/4bedX.png but I know for that none of the values in my list has any value higher than 200, yet the x-axis seems to be higher.
I loaded these values into a list, and in the graph-making function I printed the list just to check and it has the correct values. This is what my code looks like. I have already predefined yax, and I load it into a numpy array.
myarray = np.asarray(yax)
plt.hist(myarray, bins=100, histtype='stepfilled')
plt.xlabel("Bins")
plt.ylabel("Frequency")
plt.ylim(0,10)
My list of values looks something like this (except larger):
[38, 45, 43, 36, 35, 32, 31, 32, 31, 35, 38, 35, 33, 33, 36, 36, 35, 36, 39, 41, 38, 37, 39, 39, 38, 35, 34, 35, 38, 42, 37, 37, 34, 34, 29, 30, 37, 33, 31, 32, 35, 36, 41, 46, 44, 46, 42, 38, 41, 40, 38]
Here's the actual list which I'm trying to run the program on: http://pastebin.com/U1u6SPsA

The histogram is correct. Assuming the full array you have provided is defined as "yax",
import matplotlib.pyplot as plt; import numpy as np
myarray = np.asarray(yax)
plt.hist(myarray, bins=100, histtype='stepfilled')
plt.xlabel("Bins")
plt.ylabel("Frequency")
plt.ylim(0,10)
plt.show() # produces the exact histogram you provided at http://i.stack.imgur.com/4bedX.png
len([x for x in yax if x >= 200])
>>> 102
len([x for x in yax if x >= 1000])
>>> 37
len([x for x in yax if x >= 6000])
>>> 13
len([x for x in yax if x >= 7000])
>>> 3 # matches the height of the histogram for the bin # 7000
The lesson is that visual inspection is not enough -- don't make assumptions about the composition of your dataset.

Related

scipy.stats.binned_statistic_dd() bin numbering has lots of extra bins

I'm struggling to deal with a scipy.stats.binned_statistic_dd() result. I have an array of positions and another array of ids that I'm binning in 3 directions. I'm providing a list of the bin edges as input rather than a number of bins in each direction coupled with a range option. I have 3 bins in x, 2 in y, and 3 in z, or 18 bins.
However, when I check the binnumbers listed, they are all in a range greater than 20. How do I get the bin numbers to reflect the number of bins provided and get rid of all the extra bins?
I've tried to follow what was suggested in this post (Output in scipy.stats.binned_statistic_dd()) which deals with something similar, but I can't understand how to apply this to my case. As usual, the documentation is as cryptic as ever.
Any help on get my binnumbers between 1-18 in this example would be greatly appreciated!
pos = np.array([[-0.02042167, -0.0223282 , 0.00123734],
[-0.0420364 , 0.01196078, 0.00694259],
[-0.09625651, -0.00311446, 0.06125461],
[-0.07693234, -0.02749618, 0.03617278],
[-0.07578646, 0.01199925, 0.02991888],
[-0.03258293, -0.00371765, 0.04245596],
[-0.06765955, 0.02798434, 0.07075846],
[-0.02431445, 0.02774102, 0.06719837],
[ 0.02798265, -0.01096739, -0.01658691],
[-0.00584252, 0.02043389, -0.00827088],
[ 0.00623063, -0.02642285, 0.03232817],
[ 0.00884222, 0.01498996, 0.02912483],
[ 0.07189474, -0.01541584, 0.01916607],
[ 0.07239394, 0.0059483 , 0.0740187 ],
[-0.08519159, -0.02894125, 0.10923724],
[-0.10803509, 0.01365444, 0.09555333],
[-0.0442866 , -0.00845725, 0.10361843],
[-0.04246779, 0.00396127, 0.1418258 ],
[-0.08975861, 0.02999023, 0.12713186],
[ 0.01772454, -0.0020405 , 0.08824418]])
ids = np.array([16, 9, 6, 19, 1, 4, 10, 5, 18, 11, 2, 12, 13, 8, 3, 17, 14,
15, 20, 7])
xbinEdges = np.array([-0.15298488, -0.05108961, 0.05080566, 0.15270093])
ybinEdges = np.array([-0.051, 0. , 0.051])
zbinEdges = np.array([-0.053, 0.049, 0.151, 0.253])
ret = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=False)
bincounts = ret.statistic
binnumber = ret.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
ranges = [[-0.15298488071, 0.15270092971],
[-0.051000000000000004, 0.051000000000000004],
[-0.0530000000000001, 0.25300000000000006]]
ret3 = stats.binned_statistic_dd(pos, ids, bins=(3,2,3), statistic='count', expand_binnumbers=False, range=ranges)
bincounts = ret3.statistic
binnumber = ret3.binnumber.T
>>> binnumber = array([46, 51, 27, 26, 31, 46, 32, 52, 46, 51, 46, 51, 66, 72, 27, 32, 47,
52, 32, 47], dtype=int64)
Ok, after several days of background thinking and a quick scour through the binned_statistic_dd() source code I think I've come to the correct answer and it's pretty simple.
It seem binned_statistic_dd() adds an extra set of outlier bins in the binning phase and then removes these when returning the histogram results, but leaving the bin numbers untouched (I think this is in case you want to reuse the result for further stats outputs).
So it seems that if you export the expanded binnumbers (expand_binnumbers=True) and then subtract 1 from each binnumber to re-adjust the bin indices you can calculate the "correct" bin ids.
ret2 = stats.binned_statistic_dd(pos, ids, bins=[xbinEdges, ybinEdges, zbinEdges],
statistic='count', expand_binnumbers=True)
bincounts2 = ret2.statistic
binnumber2 = ret2.binnumber
indxnum2 = binnumber2-1
corrected_bin_ids = np.ravel_multi_index((indxnum2),(numX, numY, numZ))
Quick and simple in the end!

Find propotional sampling using python

I'm given a problem that explicitly asks me not to use numpy and pandas
Prob : Selecting an element from the list A randomly with probability proportional to its magnitude. assume we are doing the same experiment for 100 times with replacement, in each experiment you will print a number that is selected randomly from A.
Ex 1: A = [0 5 27 6 13 28 100 45 10 79]
let f(x) denote the number of times x getting selected in 100 experiments.
f(100) > f(79) > f(45) > f(28) > f(27) > f(13) > f(10) > f(6) > f(5) > f(0)
Initially, I took the sum of all the elements of list A
I then divided (in order to normaliz) each element of list A by the sum and stored each of these values in another list (d_dash)
I then created another empty list (d_bar), that takes in cumalative sum of all elements of d_dash
created variable r, where r= random.uniform(0.0,1.0), and then for the length of d_dash comapring r to d_dash[k], if r<=d_dash[k], return A[k]
However, I'm getting the error list index out of range near d_dash[j].append((A[j]/sum)), not sure what is the issue here as I did not exceed the index of either d_dash or A[j].
Also, is my logic correct ? sharing a better way to do this would be appreciated.
Thanks in advance.
import random
A = [0,5,27,6,13,28,100,45,10,79]
def propotional_sampling(A):
sum=0
for i in range(len(A)):
sum = sum + A[i]
d_dash=[]
for j in range(len(A)):
d_dash[j].append((A[j]/sum))
#cumulative sum
d_bar =[]
d_bar[0]= 0
for k in range(len(A)):
d_bar[k] = d_bar[k] + d_dash[k]
r = random.uniform(0.0,1.0)
number=0
for p in range(len(d_bar)):
if(r<=d_bar[p]):
number=d_bar[p]
return number
def sampling_based_on_magnitued():
for i in range(1,100):
number = propotional_sampling(A)
print(number)
sampling_based_on_magnitued()
Below is the code to do the same :
A = [0, 5, 27, 6, 13, 28, 100, 45, 10, 79]
#Sum of all the elements in the array
S = sum(A)
#Calculating normalized sum
norm_sum = [ele/S for ele in A]
#Calculating cumulative normalized sum
cum_norm_sum = []
cum_norm_sum.append(norm_sum[0])
for itr in range(1, len(norm_sum), 1) :
cum_norm_sum.append(cum_norm_sum[-1] + norm_sum[itr])
def prop_sampling(cum_norm_sum) :
"""
This function returns an element
with proportional sampling.
"""
r = random.random()
for itr in range(len(cum_norm_sum)) :
if r < cum_norm_sum[itr] :
return A[itr]
#Sampling 1000 elements from the given list with proportional sampling
sampled_elements = []
for itr in range(1000) :
sampled_elements.append(prop_sampling(cum_norm_sum))
Below image shows the frequency of each element in the sampled points :
Clearly the number of times each elements appears is proportional to its magnitude.
Cumulative sum can be computed by itertools.accumulate. The loop:
for p in range(len(d_bar)):
if(r<=d_bar[p]):
number=d_bar[p]
can be substituted by bisect.bisect() (doc):
import random
from itertools import accumulate
from bisect import bisect
A = [0,5,27,6,13,28,100,45,10,79]
def propotional_sampling(A, n=100):
# calculate cumulative sum from A:
cum_sum = [*accumulate(A)]
# cum_sum = [0, 5, 32, 38, 51, 79, 179, 224, 234, 313]
out = []
for _ in range(n):
i = random.random() # i = [0.0, 1.0)
idx = bisect(cum_sum, i*cum_sum[-1]) # get index to list A
out.append(A[idx])
return out
print(propotional_sampling(A))
Prints (for example):
[10, 100, 100, 79, 28, 45, 45, 27, 79, 79, 79, 79, 100, 27, 100, 100, 100, 13, 45, 100, 5, 100, 45, 79, 100, 28, 79, 79, 6, 45, 27, 28, 27, 79, 100, 79, 79, 28, 100, 79, 45, 100, 10, 28, 28, 13, 79, 79, 79, 79, 28, 45, 45, 100, 28, 27, 79, 27, 45, 79, 45, 100, 28, 100, 100, 5, 100, 79, 28, 79, 13, 100, 100, 79, 28, 100, 79, 13, 27, 100, 28, 10, 27, 28, 100, 45, 79, 100, 100, 100, 28, 79, 100, 45, 28, 79, 79, 5, 45, 28]
The reason you got "list index out of range" message is that you created an empty list "d_bar =[]" and the started assigning value to it "d_bar[k] = d_bar[k] + d_dash[k]". I recoomment using the followoing structor isntead:
First, define it in this way:
d_bar=[0 for i in range(len(A))]
Also, I believe this code will return 1 forever as there is no break in the loop. you can resolve this issue by adding "break". here is updated version of your code:
A = [0, 5, 27, 6, 13, 28, 100, 45, 10, 79]
def pick_a_number_from_list(A):
sum=0
for i in A:
sum+=i
A_norm=[]
for j in A:
A_norm.append(j/sum)
A_cum=[0 for i in range(len(A))]
A_cum[0]=A_norm[0]
for k in range(len(A_norm)-1):
A_cum[k+1]=A_cum[k]+A_norm[k+1]
A_cum
r = random.uniform(0.0,1.0)
number=0
for p in range(len(A_cum)):
if(r<=A_cum[p]):
number=A[p]
break
return number
def sampling_based_on_magnitued():
for i in range(1,100):
number = pick_a_number_from_list(A)
print(number)
sampling_based_on_magnitued()

Finding the closest to value in two datasets using a for loop

In MATLAB, I am able to find to identify the values in data_b that come closest to the values in data_a, alongside the indices that indicate in which place in the matrix they occur, with the following code:
clear all; close all; clc;
data_a = [0; 15; 30; 45; 60; 75; 90];
data_b = randi([0, 90], [180, 101]);
[rows_a,cols_a] = size(data_a);
[rows_b,cols_b] = size(data_b);
val1 = zeros(rows_a,cols_b);
ind1 = zeros(rows_a,cols_b);
for i = 1:cols_b
for j = 1:rows_a
[val1(j,i),ind1(j,i)] = min(abs(data_b(:,i) - data_a(j)));
end
end
Since I would like to phase out MATLAB (I will be out of a license eventually), I decided to try the same in python, without any luck:
import numpy as np
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180, 101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b))
ind1 = np.zeros((rows_a,cols_b))
for i in range(cols_b):
for j in range(rows_a):
[val1[j][i],ind1[j][i]] = np.amin(np.abs(data_b[:][i] - data_a[j]))
The code also produced an error that made me none the wiser:
TypeError: cannot unpack non-iterable numpy.int32 object
If anyone could find time to explain why I am an ignorant fool by indicating what I did wrong, and what I could do to fix it, I would be grateful as this has proven to become a major obstacle for my progress.
Thank you.
I think you are facing two problems:
Incorrect use of slicing for multidimensional arrays: use [i, j] instead of [i][j]
Improper translation of min() from MATLAB to NumPy: you have to use both argmin() and min().
Your fixed code would look like:
import numpy as np
# just to make it reproducible in testing, can be commented for production
np.random.seed(0)
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180, 101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b), dtype=int)
ind1 = np.zeros((rows_a,cols_b), dtype=int)
for i in range(cols_b):
for j in range(rows_a):
ind1[j, i] = np.argmin(np.abs(data_b[:, i] - data_a[j]))
val1[j, i] = np.min(np.abs(data_b[:, i] - data_a[j])[ind1[j, i]])
However, I would avoid direct looping here and I would make good use of broadcasting:
import numpy as np
# just to make it reproducible in testing, can be commented for production
np.random.seed(0)
data_a = np.arange(0, 90 + 1, 15).reshape((-1, 1, 1))
data_b = np.random.randint(90 + 1, size=(1, 180, 101))
tmp_arr = np.abs(data_a.reshape(-1, 1, 1) - data_b.reshape(1, 180, -1), dtype=int)
min_idxs = np.argmin(tmp_arr, axis=1)
min_vals = np.min(tmp_arr, axis=1)
del tmp_arr # you can delete this if you no longer need it
where now ind1 == min_idxs and val1 == min_vals, i.e.:
print(np.all(min_idxs == ind1))
# True
print(np.all(min_vals == val1))
# True
Your error has to do with "[val1[j][i],ind1[j][i]] = (a single number)". You are trying to assign a single value to it which doesn't work in python. What about this?
import numpy as np
data_a = np.array([[0],[15],[30],[45],[60],[75],[90]])
data_b = np.random.randint(91, size=(180,101))
[rows_a,cols_a] = data_a.shape
[rows_b,cols_b] = data_b.shape
val1 = np.zeros((rows_a,cols_b))
ind1 = np.zeros((rows_a,cols_b))
for i in range(cols_b):
for j in range(rows_a):
array = np.abs(data_b[:][i] - data_a[j])
val = np.amin(array)
val1[j][i] = val
ind1[j][i] = np.where(val == array)[0][0]
Numpy amin does not return an index so you need to return it using np.where. This example does not store the full index, only the index of the first occurrence in the row. Then you can pull it out since your row order matches your column order in ind1 and data_b. So for instance on the first iteration.
In [2]: np.abs(data_b[:][0] - data_a[j0])
Out[2]:
array([ 3, 31, 19, 53, 28, 81, 10, 11, 89, 15, 50, 22, 40, 81, 43, 29, 63,
72, 22, 37, 54, 12, 19, 78, 85, 78, 37, 81, 41, 24, 29, 56, 37, 86,
67, 7, 38, 27, 83, 81, 66, 32, 68, 29, 71, 26, 12, 27, 45, 58, 17,
57, 54, 55, 23, 21, 46, 58, 75, 10, 25, 85, 70, 76, 0, 11, 19, 83,
81, 68, 8, 63, 72, 48, 18, 29, 0, 47, 85, 79, 72, 85, 28, 28, 7,
41, 80, 56, 59, 44, 82, 33, 42, 23, 42, 89, 58, 52, 44, 65, 65])
In [3]: np.amin(array)
Out[3]: 0
In [4]: val
Out[4]: 0
In [5]: np.where(val == array)[0][0]
Out[5]: 69
In [6]: data_b[0,69]
Out[6]: 0

Blockproc like function for Python image processing

edit: it's an image so the suggested (How can I efficiently process a numpy array in blocks similar to Matlab's blkproc (blockproc) function) isn't really working for me
I have the following matlab code
fun = #(block_struct) ...
std2(block_struct.data) * ones(size(block_struct.data));
B=blockproc(im2double(Icorrected), [4 4], fun);
I want to remake my code, but this time in Python. I have installed Scikit and i'm trying to work around it like this
b = np.std(a, axis = 2)
The problem of course it's that i'm not applying the std for a number of blocks, just like above.
How can i do something like this? Start a loop and try to call the function for each X*X blocks? Then i wouldn't keep the size the it was.
Is there another more efficient way?
If there is no overlap in the windows you can reshape the data to suit your needs:
Find the mean of 3x3 windows of a 9x9 array.
import numpy as np
>>> a
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8],
[ 9, 10, 11, 12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23, 24, 25, 26],
[27, 28, 29, 30, 31, 32, 33, 34, 35],
[36, 37, 38, 39, 40, 41, 42, 43, 44],
[45, 46, 47, 48, 49, 50, 51, 52, 53],
[54, 55, 56, 57, 58, 59, 60, 61, 62],
[63, 64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79, 80]])
Find the new shape
>>> window_size = (3,3)
>>> tuple(np.array(a.shape) / window_size) + window_size
(3, 3, 3, 3)
>>> b = a.reshape(3,3,3,3)
Find the mean along the first and third axes.
>>> b.mean(axis = (1,3))
array([[ 10., 13., 16.],
[ 37., 40., 43.],
[ 64., 67., 70.]])
>>>
2x2 windows of a 4x4 array:
>>> a = np.arange(16).reshape((4,4))
>>> window_size = (2,2)
>>> tuple(np.array(a.shape) / window_size) + window_size
(2, 2, 2, 2)
>>> b = a.reshape(2,2,2,2)
>>> b.mean(axis = (1,3))
array([[ 2.5, 4.5],
[ 10.5, 12.5]])
>>>
It won't work if the window size doesn't divide into the array size evenly. In that case you need some overlap in the windows or if you just want overlap numpy.lib.stride_tricks.as_strided is the way to go - a generic N-D function can be found at Efficient Overlapping Windows with Numpy
Another option for 2d arrays is sklearn.feature_extraction.image.extract_patches_2d and for ndarray's - sklearn.feature_extraction.image.extract_patches. Each manipulate the array's strides to produce the patches/windows.
I did the following
io.use_plugin('pil', 'imread')
a = io.imread('C:\Users\Dimitrios\Desktop\polimesa\\arizona.jpg')
B = np.zeros((len(a)/2 +1, len(a[0])/2 +1))
for i in xrange(0, len(a), 2):
for j in xrange(0, len(a[0]), 2):
x.append(a[i][j])
if i+1 < len(a):
x.append(a[i+1][j])
if j+1 < len(a[0]):
x.append(a[i][j+1])
if i+1 < len(a) and j+1 < len(a[0]):
x.append(a[i+1][j+1])
B[i/2][j/2] = np.std(x)
x[:] = []
and i think it's correct. Iterating over the image by 2 and taking each neighbour node, adding them to a list and calculating std.
edit* later edited for 4x4 blocks.
We can implement blockproc() in python the following way:
def blockproc(im, block_sz, func):
h, w = im.shape
m, n = block_sz
for x in range(0, h, m):
for y in range(0, w, n):
block = im[x:x+m, y:y+n]
block[:,:] = func(block)
return im
Now, let's apply it to implement contrast enhancement with local histogram equalization, with the low-contrast moon image (of size 512x512) as input and choosing 32x32 blocks:
from skimage import data, exposure
img = data.moon()
img = img / img.max()
m, n = 64, 64
img_eq = blockproc(img.copy(), (m, n), exposure.equalize_hist)
Display the input and output images:
Note that the function does in-place modification to the image, hence a copy of the input image is passed instead.

Python: Finding a trend in a set of numbers

I have a list of numbers in Python, like this:
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
What's the best way to find the trend in these numbers? I'm not interested in predicting what the next number will be, I just want to output the trend for many sets of numbers so that I can compare the trends.
Edit: By trend, I mean that I'd like a numerical representation of whether the numbers are increasing or decreasing and at what rate. I'm not massively mathematical, so there's probably a proper name for this!
Edit 2: It looks like what I really want is the co-efficient of the linear best fit. What's the best way to get this in Python?
Possibly you mean you want to plot these numbers on a graph and find a straight line through them where the overall distance between the line and the numbers is minimized? This is called a linear regression
def linreg(X, Y):
"""
return a,b in solution to y = ax + b such that root mean square distance between trend line and original points is minimized
"""
N = len(X)
Sx = Sy = Sxx = Syy = Sxy = 0.0
for x, y in zip(X, Y):
Sx = Sx + x
Sy = Sy + y
Sxx = Sxx + x*x
Syy = Syy + y*y
Sxy = Sxy + x*y
det = Sxx * N - Sx * Sx
return (Sxy * N - Sy * Sx)/det, (Sxx * Sy - Sx * Sxy)/det
x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
a,b = linreg(range(len(x)),x) //your x,y are switched from standard notation
The trend line is unlikely to pass through your original points, but it will be as close as possible to the original points that a straight line can get. Using the gradient and intercept values of this trend line (a,b) you will be able to extrapolate the line past the end of the array:
extrapolatedtrendline=[a*index + b for index in range(20)] //replace 20 with desired trend length
The Link provided by Keith or probably the answer from Riaz might help you to get the poly fit, but it is always recommended to use libraries if available, and for the problem in your hand, numpy provides a wonderful polynomial fit function called polyfit . You can use polyfit to fit the data over any degree of equation.
Here is an example using numpy to fit the data in a linear equation of the form y=ax+b
>>> data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> x = np.arange(0,len(data))
>>> y=np.array(data)
>>> z = np.polyfit(x,y,1)
>>> print "{0}x + {1}".format(*z)
4.32527472527x + 17.6
>>>
similarly a quadratic fit would be
>>> print "{0}x^2 + {1}x + {2}".format(*z)
0.311126373626x^2 + 0.280631868132x + 25.6892857143
>>>
Here is one way to get an increasing/decreasing trend:
>>> x = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
>>> trend = [b - a for a, b in zip(x[::1], x[1::1])]
>>> trend
[22, -5, 9, -4, 17, -22, 5, 13, -13, 21, 39, -26, 13]
In the resulting list trend, trend[0] can be interpreted as the increase from x[0] to x[1], trend[1] would be the increase from x[1] to x[2] etc. Negative values in trend mean that value in x decreased from one index to the next.
You could do a least squares fit of the data.
Using the formula from this page:
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
N = len(y)
x = range(N)
B = (sum(x[i] * y[i] for i in xrange(N)) - 1./N*sum(x)*sum(y)) / (sum(x[i]**2 for i in xrange(N)) - 1./N*sum(x)**2)
A = 1.*sum(y)/N - B * 1.*sum(x)/N
print "%f + %f * x" % (A, B)
Which prints the starting value and delta of the best fit line.
I agree with Keith, I think you're probably looking for a linear least squares fit (if all you want to know is if the numbers are generally increasing or decreasing, and at what rate). The slope of the fit will tell you at what rate they're increasing. If you want a visual representation of a linear least squares fit, try Wolfram Alpha:
http://www.wolframalpha.com/input/?i=linear+fit+%5B12%2C+34%2C+29%2C+38%2C+34%2C+51%2C+29%2C+34%2C+47%2C+34%2C+55%2C+94%2C+68%2C+81%5D
Update: If you want to implement a linear regression in Python, I recommend starting with the explanation at Mathworld:
http://mathworld.wolfram.com/LeastSquaresFitting.html
It's a very straightforward explanation of the algorithm, and it practically writes itself. In particular, you want to pay close attention to equations 16-21, 27, and 28.
Try writing the algorithm yourself, and if you have problems, you should open another question.
You can find the OLS coefficient using numpy:
import numpy as np
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = []
x.append(range(len(y))) #Time variable
x.append([1 for ele in xrange(len(y))]) #This adds the intercept, use range in Python3
y = np.matrix(y).T
x = np.matrix(x).T
betas = ((x.T*x).I*x.T*y)
Results:
>>> betas
matrix([[ 4.32527473], #coefficient on the time variable
[ 17.6 ]]) #coefficient on the intercept
Since the coefficient on the trend variable is positive, observations in your variable are increasing over time.
You can use simply scipy library
from scipy.stats import linregress
data = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = np.arange(1,len(data)+1)
y=np.array(data)
res = linregress(x, y)
print(f'Equation: {res[0]:.3f} * t + {res[1]:.3f}, R^2: {res[2] ** 2:.2f} ')
res
Output:
Equation: 4.325 * t + 13.275, R^2: 0.66
LinregressResult(slope=4.325274725274725, intercept=13.274725274725277, rvalue=0.8096297800892154, pvalue=0.0004497809466484867, stderr=0.9051717124425395, intercept_stderr=7.707259409345618)
Compute the beta coefficient.
y = [12, 34, 29, 38, 34, 51, 29, 34, 47, 34, 55, 94, 68, 81]
x = range(1,len(y)+1)
def var(X):
S = 0.0
SS = 0.0
for x in X:
S += x
SS += x*x
xbar = S/float(len(X))
return (SS - len(X) * xbar * xbar) / (len(X) -1.0)
def cov(X,Y):
n = len(X)
xbar = sum(X) / n
ybar = sum(Y) / n
return sum([(x-xbar)*(y-ybar) for x,y in zip(X,Y)])/(n-1)
def beta(x,y):
return cov(x,y)/var(x)
print beta(x,y) #4.34285714286

Categories