Finding correlation coefficient from 2 lists - python

I am working on a project which has many functions when given a couple lists of data. I've already seperated the lists and I have defined some functions which I know for certain work correctly, that being a mean function and standard deviation function. My issue is when testing my lists I get a correct mean, correct standard deviation, but incorrect correlation coefficient. Could my math be off here? I need to find the correlation coefficient with only Python's standard library.
MY CODE:
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
zList1 = []
zList2 = []
# Create 2 new lists taking (a[i]-a's Mean)/standard deviation of a
for x in someList1:
z1 = ((float(x)-xMean)/xStandDev)
zList1.append(z1)
for y in someList2:
z2 = ((float(y)-yMean)/yStandDev)
zList2.append(z2)
# Mapping out the lists to be float values instead of string
zList1 = list(map(float,zList1))
zList2 = list(map(float,zList2))
# Multiplying each value from the lists
zFinal = [a*b for a,b in zip(zList1,zList2)]
totalZ = 0
# Taking the sum of all the products
for a in zFinal:
totalZ += a
# Finally calculating correlation coefficient
r = (1/(len(someList1) - 1)) * totalZ
return r
SAMPLE RUN:
I have a list of [1,2,3,4,4,8] and [3,3,4,5,8,9]
I expect the correct answer of r = 0.8848, but get r = .203727
EDIT: To include the mean and standard deviation functions I have made.
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
newList = []
sdTotal = 0
listMean = mean(someList)
for a in someList:
newNum = (float(a) - listMean)**2
newList.append(newNum)
for z in newList:
sdTotal += float(z)
standardDeviation = sdTotal/(len(newList))
return standardDeviation

The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]

Pearson Correlation Coefficient
Code (modified)
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
listMean = mean(someList)
dev = 0.0
for i in range(len(someList)):
dev += (someList[i]-listMean)**2
dev = dev**(1/2.0)
return dev
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
# r numerator
rNum = 0.0
for i in range(len(someList1)):
rNum += (someList1[i]-xMean)*(someList2[i]-yMean)
# r denominator
rDen = xStandDev * yStandDev
r = rNum/rDen
return r
print(correlCo([1,2,3,4,4,8], [3,3,4,5,8,9]))
Output
0.884782972876

Normally according to the standard deviation formula, you should have divided the dev to the sample number (length of the list) before sqrrt it.Right?
I mean:
dev += ((someList[i]-listMean)**2)/len(someList)
enter image description here

Your standard deviation is wrong. You forgot to take the squareroot.
You are actually returning variance and not standard deviation from that function.
#DeathPox

Related

How to speed up an N dimensional interval tree in python?

Consider the following problem: Given a set of n intervals and a set of m floating-point numbers, determine, for each floating-point number, the subset of intervals that contain the floating-point number.
This problem has been addressed by constructing an interval tree (or called range tree or segment tree). Implementations have been done for the one-dimensional case, e.g. python's intervaltree package. Usually, these implementations consider one or few floating-point numbers, namely a small "m" above.
In my problem setting, both n and m are extremely large numbers (from solving an image processing problem). Further, I need to consider the N-dimensional intervals (called cuboid when N=3, because I was modeling human brains with the Finite Element Method). I have implemented a simple N-dimensional interval tree in python, but it run in a loop and can only take one floating-point number at a time. Can anyone help improve the implementation in terms of efficiency? You can change data structure freely.
import sys
import time
import numpy as np
# find the index of a satisfying x > a in one dimension
def find_index_smaller(a, x):
idx = np.argsort(a)
ss = np.searchsorted(a, x, sorter=idx)
res = idx[0:ss]
return res
# find the index of a satisfying x < a in one dimension
def find_index_larger(a, x):
return find_index_smaller(-a, -x)
# find the index of a satisfing amin < x < amax in one dimension
def find_intv_at(amin, amax, x):
idx = find_index_smaller(amin, x)
idx2 = find_index_larger(amax[idx], x)
res = idx[idx2]
return res
# find the index of a satisfying amin < x < amax in N dimensions
def find_intv_at_nd(amin, amax, x):
dim = amin.shape[0]
res = np.arange(amin.shape[-1])
for i in range(dim):
idx = find_intv_at(amin[i, res], amax[i, res], x[i])
res = res[idx]
return res
I also have two test examples for sanity check and performance testing:
def demo1():
print ("By default, we do a correctness test")
n_intv = 2
n_point = 2
# generate the test data
point = np.random.rand(3, n_point)
intv_min = np.random.rand(3, n_intv)
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point ")
print (point)
print ("intv_min")
print (intv_min)
print ("intv_max")
print (intv_max)
print ("===Indexes of intervals that contain the point===")
for i in range(n_point):
print (find_intv_at_nd(intv_min,intv_max, point[:, i]))
def demo2():
print ("Performance:")
n_points=100
n_intv = 1000000
# generate the test data
points = np.random.rand(n_points, 3)*512
intv_min = np.random.rand(3, n_intv)*512
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point.shape = "+str(points.shape))
print ("intv_min.shape = "+str(intv_min.shape))
print ("intv_max.shape = "+str(intv_max.shape))
starttime = time.time()
for point in points:
tmp = find_intv_at_nd(intv_min, intv_max, point)
print("it took this long to run {} points, with {} interva: {}".format(n_points, n_intv, time.time()-starttime))
My idea would be:
Remove np.argsort() from the algo, because the interval tree does not change, so sorting could have been done in pre-processing.
Vectorize x. The algo runs a loop for each x. It would be nice if we can get rid of the loop over x.
Any contribution would be appreciated.

How to detect outliers "among neighbouring values" of univariate data using standard python

I have a simple data-set that has 2 columns date/temperature.
I need some method to find outliers:
1. In the temperature column
2. Only w.r.t to some neighbouring values
What will be the best way to detect outliers w.r.t neighbouring values only?
I have tried coding up IQR, Mean/Median deviation etc but they prune out more values. I tried applying those methods to 10 values at a time but that ends up finding even more outliers than I expect.
Also I need to finally code it in standard python, so any hints on that will be useful.
Below are the functions from my class DataChecker that calculate IQR and use it to check outlier.
class DataChecker:
class DateTemperature:
def __init__(self, input_date, temperature):
try:
day, month, year = input_date.split('/')
self._temperature_date = date(int(year), int(month), int(day))
except ValueError:
# Don't tolerate invalid date
raise #TODO change to custom error
try:
self._temperature = float(temperature)
except (TypeError, ValueError):
self._temperature = 0
#property
def date(self):
return self._temperature_date.strftime('%d/%m/%Y')
#property
def temperature(self):
return self._temperature
def __init__(self, input_date_temperature_values):
self._date_temperature_values = []
for date, temperature in input_date_temperature_values:
try:
self._date_temperature_values.append(self.DateTemperature(date, temperature))
except ValueError:
pass
self._date_temperature_values.sort(key=lambda x:x.date)
self._outlier_low, self._outlier_high = self._calculate_outlier_thresholds(self._date_temperature_values)
def _is_value_outlier(self, temperature):
if temperature < self._outlier_low or temperature > self._outlier_high:
return True
return False
def _calculate_outlier_thresholds(self, data_temperature_values):
temperature_values = sorted([dataTemperature.temperature for dataTemperature in data_temperature_values])
median_index = len(temperature_values) // 2
first_quartile = median(temperature_values[:median_index])
third_quartile = median(temperature_values[median_index+1:])
iqr = (third_quartile - first_quartile)
# Tried with 1.5, 1.2, 2 etc
low_iqr = first_quartile - 1.2*iqr
high_iqr = third_quartile + 1.2*iqr
# Trying mean/median deviation
#mean_value = statistics.median(temperature_values)
#std_dev = statistics.pstdev(temperature_values)
#print(f'{mean_value} : {std_dev}')
#low_iqr = mean_value - 2*std_dev
#high_iqr = mean_value + 2*std_dev
#print(low_iqr, ':', high_iqr)
return low_iqr, high_iqr
Thanks!
Standard Deviation: The standard deviation is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
Mean: Mean is the sum of the sampled values divided by the number of items.
import numpy as np
def removeOutlier(data):
data = np.array(data)
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
final_list = [x for x in data if (x > mean - 2 * std)]
final_list = [x for x in final_list if (x < mean + 2 * std)]
return final_list
you can select the neighboring point and put it in a list then if you pass it through this function you'll get outlier free list.

Calculating the Standard Deviation of an text file

I'm trying to calculate the Standard Deviation of all the data thats in the column of "ClosePrices" see the pastebin https://pastebin.com/JtGr672m
We need to calculate one Standard Deviation of all the 1029 floats.
This is my code:
ins1 = open("bijlage.txt", "r")
for line in ins1:
numbers = [(n) for n in number_strings]
i = i + 1
ClosePriceSD = []
ClosePrice = float(data[0][5].replace(',', '.'))
ClosePriceSD.append(ClosePrice)
def sd_calc(data):
n = 1029
if n <= 1:
return 0.0
mean, sd = avg_calc(data), 0.0
# calculate stan. dev.
for el in data:
sd += (float(el) - mean)**2
sd = math.sqrt(sd / float(n-1))
return sd
def avg_calc(ls):
n, mean = len(ls), 0.0
if n <= 1:
return ls[0]
# calculate average
for el in ls:
mean = mean + float(el)
mean = mean / float(n)
return mean
print("Standard Deviation:")
print(sd_calc(ClosePriceSD))
print()
So what I'm trying to calculate is the Standard Deviation of all the floats under the "Closeprices" part.
well I have this "ClosePrice = float(data[0][5].replace(',', '.'))" this should calculate the Standard Deviation from all the floats that are under ClosePrice but it only calculates it from data[0][5]. But I want it to calculate one standard deviation from all the 1029 floats under ClosePrice
I think your error is in the for loop at the beginning. You have for line in ins1 but then you never use line inside the loop. And in your loop you also use number_string and data which are not defined before.
Here is how you can extract the data from you txt file.
with open("bijlage.txt", "r") as ff:
ll = ff.readlines() #extract a list, each element is a line of the file
data = []
for line in ll[1:]: #excluding the first line wich is an header
d = line.split(';')[5] #split each line in a list using semicolon as a separator and keep the element with index 5
data.append(float(d.replace(',', '.'))) #substituting the comma with the dot in the string and convert it to a float
print data #data is a list with all the numbers you want
You should be able to calculate mean and standard deviation from here.
You didn't really specify what the issue/error is. Although this probably doesn't help if it is a school project, you could install scipy, which has a standard deviation function. In this case, just put your array in as a parameter. Could you elaborate on what you're having trouble with? Is the current code giving an error?
Edit:
Looking at the data, you want the 6th element in each line (ClosePrice). If your function is working, and all you need is an array of the ClosedPrice's, this is what I would suggest.
data = []
lines = []
ins1 = open("bijlage.txt", "r")
lines = [lines.rstrip('\n') for line in ins1]
for line in lines:
line.split('\;')
data.append(line[5])
for i in data:
data[i] = float(data[i])
def sd_calc(data):
n = 1029
if n <= 1:
return 0.0
mean, sd = avg_calc(data), 0.0
# calculate stan. dev.
for el in data:
sd += (float(el) - mean)**2
sd = math.sqrt(sd / float(n-1))
return sd
def avg_calc(ls):
n, mean = len(ls), 0.0
if n <= 1:
return ls[0]
# calculate average
for el in ls:
mean = mean + float(el)
mean = mean / float(n)
return mean
print("Standard Deviation:")
print(sd_calc(data))
print()

(Python) Markov, Chebyshev, Chernoff upper bound functions

I'm stuck with one task on my learning path.
For the binomial distribution X∼Bp,n with mean μ=np and variance σ**2=np(1−p), we would like to upper bound the probability P(X≥c⋅μ) for c≥1.
Three bounds introduced:
Formulas
The task is to write three functions respectively for each of the inequalities. They must take n , p and c as inputs and return the upper bounds for P(X≥c⋅np) given by the above Markov, Chebyshev, and Chernoff inequalities as outputs.
And there is an example of IO:
Code:
print Markov(100.,0.2,1.5)
print Chebyshev(100.,0.2,1.5)
print Chernoff(100.,0.2,1.5)
Output
0.6666666666666666
0.16
0.1353352832366127
I'm completely stuck. I just can't figure out how to plug in all that math into functions (or how to think algorithmically here). If someone could help me out, that would be of great help!
p.s. and all libs are not allowed by task conditions except math.exp
Ok, let's look at what's given:
Input and derived values:
n = 100
p = 0.2
c = 1.5
m = n*p = 100 * 0.2 = 20
s2 = n*p*(1-p) = 16
s = sqrt(s2) = sqrt(16) = 4
You have multiple inequalities of the form P(X>=a*m) and you need to provide bounds for the term P(X>=c*m), so you need to think how a relates to c in all cases.
Markov inequality: P(X>=a*m) <= 1/a
You're asked to implement Markov(n,p,c) that will return the upper bound for P(X>=c*m). Since from
P(X>=a*m)
= P(X>=c*m)
it's clear that a == c, you get 1/a = 1/c. Well, that's just
def Markov(n, p, c):
return 1.0/c
>>> Markov(100,0.2,1.5)
0.6666666666666666
That was easy, wasn't it?
Chernoff inequality states that P(X>=(1+d)*m) <= exp(-d**2/(2+d)*m)
First, let's verify that if
P(X>=(1+d)*m)
= P(X>=c *m)
then
1+d = c
d = c-1
This gives us everything we need to calculate the uper bound:
def Chernoff(n, p, c):
d = c-1
m = n*p
return math.exp(-d**2/(2+d)*m)
>>> Chernoff(100,0.2,1.5)
0.1353352832366127
Chebyshev inequality bounds P(X>=m+k*s) by 1/k**2
So again, if
P(X>=c*m)
= P(X>=m+k*s)
then
c*m = m+k*s
m*(c-1) = k*s
k = m*(c-1)/s
Then it's straight forward to implement
def Chebyshev(n, p, c):
m = n*p
s = math.sqrt(n*p*(1-p))
k = m*(c-1)/s
return 1/k**2
>>> Chebyshev(100,0.2,1.5)
0.16

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Categories