Calculating the Standard Deviation of an text file - python

I'm trying to calculate the Standard Deviation of all the data thats in the column of "ClosePrices" see the pastebin https://pastebin.com/JtGr672m
We need to calculate one Standard Deviation of all the 1029 floats.
This is my code:
ins1 = open("bijlage.txt", "r")
for line in ins1:
numbers = [(n) for n in number_strings]
i = i + 1
ClosePriceSD = []
ClosePrice = float(data[0][5].replace(',', '.'))
ClosePriceSD.append(ClosePrice)
def sd_calc(data):
n = 1029
if n <= 1:
return 0.0
mean, sd = avg_calc(data), 0.0
# calculate stan. dev.
for el in data:
sd += (float(el) - mean)**2
sd = math.sqrt(sd / float(n-1))
return sd
def avg_calc(ls):
n, mean = len(ls), 0.0
if n <= 1:
return ls[0]
# calculate average
for el in ls:
mean = mean + float(el)
mean = mean / float(n)
return mean
print("Standard Deviation:")
print(sd_calc(ClosePriceSD))
print()
So what I'm trying to calculate is the Standard Deviation of all the floats under the "Closeprices" part.
well I have this "ClosePrice = float(data[0][5].replace(',', '.'))" this should calculate the Standard Deviation from all the floats that are under ClosePrice but it only calculates it from data[0][5]. But I want it to calculate one standard deviation from all the 1029 floats under ClosePrice

I think your error is in the for loop at the beginning. You have for line in ins1 but then you never use line inside the loop. And in your loop you also use number_string and data which are not defined before.
Here is how you can extract the data from you txt file.
with open("bijlage.txt", "r") as ff:
ll = ff.readlines() #extract a list, each element is a line of the file
data = []
for line in ll[1:]: #excluding the first line wich is an header
d = line.split(';')[5] #split each line in a list using semicolon as a separator and keep the element with index 5
data.append(float(d.replace(',', '.'))) #substituting the comma with the dot in the string and convert it to a float
print data #data is a list with all the numbers you want
You should be able to calculate mean and standard deviation from here.

You didn't really specify what the issue/error is. Although this probably doesn't help if it is a school project, you could install scipy, which has a standard deviation function. In this case, just put your array in as a parameter. Could you elaborate on what you're having trouble with? Is the current code giving an error?
Edit:
Looking at the data, you want the 6th element in each line (ClosePrice). If your function is working, and all you need is an array of the ClosedPrice's, this is what I would suggest.
data = []
lines = []
ins1 = open("bijlage.txt", "r")
lines = [lines.rstrip('\n') for line in ins1]
for line in lines:
line.split('\;')
data.append(line[5])
for i in data:
data[i] = float(data[i])
def sd_calc(data):
n = 1029
if n <= 1:
return 0.0
mean, sd = avg_calc(data), 0.0
# calculate stan. dev.
for el in data:
sd += (float(el) - mean)**2
sd = math.sqrt(sd / float(n-1))
return sd
def avg_calc(ls):
n, mean = len(ls), 0.0
if n <= 1:
return ls[0]
# calculate average
for el in ls:
mean = mean + float(el)
mean = mean / float(n)
return mean
print("Standard Deviation:")
print(sd_calc(data))
print()

Related

I want to calculate mean and standard deviation from user input, but it says saying data type error in deviation function

I want to create a program that will ask for input until a new line is entered. The user input will be in a list. Then it will calculate the mean and standard deviation of inputs. I wrote the following code, but it shows some data type errors for stddev function.
def main():
print("Enter the data, one value per line.\nEnd by entering empty line.")
a = []
prompt = ""
line = input(prompt)
while line:
a.append(float(line))
line = input(prompt)
meanfunction(a)
stdev(a)
print("The mean of given data was: ",meanfunction(a))
print("The standard deviation of given data was: ",stdev(a))
def meanfunction(data):
average = sum(data) / len(data)
average_f = "{:.2f}".format(average)
return average_f
def variance(data):
n = len(data)
mean = sum(data) / n
deviations = [(x - mean) ** 2 for x in data]
variance = sum(deviations) / (n)
variance_f = "{:.2f}".format(variance)
return variance_f
def stdev(data):
import math
var = variance(data)
std_dev = math.sqrt(var)
return std_dev
if __name__ == "__main__":
main()
Having fixed your indentation, the issue seems to be your format string in variance(data) just before the return line. You use the output of variance as an input in the stdev function but variance returns a string output. It looks like meanfunction does the same thing.
Generally, for these mathematical functions, it would be best to just have them keep to what they are supposed to do: return a number, like you already do with stdev's return. Deal with making it pretty when it comes to actually printing it to the screen.
Also making a variable names more descriptive than "a" is nice, especially when we come to look at our old code! Lastly we usually want to put imports at the very top.
import math
def main():
print("Enter the data, one value per line.\n"
"End by entering an empty line.")
user_values = []
prompt = ""
line = input(prompt)
while line:
user_values.append(float(line))
line = input(prompt)
meanfunction(user_values)
stdev(user_values)
print(f"The mean of the given data was: {meanfunction(user_values):.2f} ")
print(f"The standard deviation of the given data was: {stdev(user_values):.2f}")
def meanfunction(data):
average = sum(data) / len(data)
return average
def variance(data):
n = len(data)
mean = sum(data) / n
deviations = [(x - mean) ** 2 for x in data]
variance = sum(deviations) / (n - 1)
return variance
def stdev(data):
var = variance(data)
std_dev = math.sqrt(var)
return std_dev
if __name__ == "__main__":
main()

Calculating the standard deviation from numbers in python

I'm trying to calculate the standard deviation from a bunch of numbers in a document.
Here's what I got so far:
with open("\\Users\\xxx\\python_courses\\1DV501\\assign3\\file_10000integers_B.txt", "r") as f:
total2 = 0
number_of_ints2 = 0
deviation = 0.0
variance = 0.0
for line in f:
for num in line.split(':'):
total2 += int(num)
number_of_ints2 += 1
average = total2/number_of_ints2
for line in f:
for num in line.split(":"):
devation += [(int(num) - average) **2
But I'm completely stuck. I dont know how to do it. Math is not my strong suite so this this is turning out to be quite difficult.
Also the document is mixed with negative and positive numbers if that makes any difference.
You can use a few available libraries, for example if I had data I got from somewhere
>>> import random
>>> data = [random.randint(1,100) for _ in range(100)] # assume from your txt file
I could use statistics.stdev
>>> import statistics
>>> statistics.stdev(data)
28.453646514989956
or numpy.std
>>> import numpy as np
>>> np.std(data)
28.311020822287563
or scipy.stats.tstd
>>> import scipy.stats
>>> scipy.stats.tstd(data)
28.453646514989956
or if you want to roll your own
def stddev(data):
mean = sum(data) / len(data)
return math.sqrt((1/len(data)) * sum((i-mean)**2 for i in data))
>>> stddev(data)
28.311020822287563
Note that the slight difference in computed value will depend on if you want "sample" standard deviation or "population" standard deviation, see here
you may use the function, here is the official documentation :
Set your numbers in a list, then apply your function :
from statistics import stdev
mylist = [1,2,5,10,100]
std = stdev(mylist)
The problem is that you are iterating over the file twice, and you didn't reset the reader to the beginning of the file before the second loop. You can use f.seek(0) to do this.
total2 = 0
number_of_ints2 = 0
deviation = 0.0
variance = 0.0
with open("numbers.txt", "r") as f:
for line in f:
for num in line.split(':'):
total2 += int(num)
number_of_ints2 += 1
average = total2 / number_of_ints2
f.seek(0) # Move back to the beginning of the file.
for line in f:
for num in line.split(":"):
deviation += (int(num) - average) ** 2

multiple files, multiprocessing and sliding window

I have a a list of files, (sometimes just one) and I want to process every n lines in parallel from each file with a sliding-window.
I do not want to use multiprocessing through files since in some cases there could be less files than cores available.
Next I want to store the output of every sliding-window in a list (for random values). I have done this so far and it works fine but there are issues.
It would be great if I could change pool.map and use pool.something_else that allows multiple parameters for my function (file, sliding window size and j).
I tried with pool.apply_async but when I do pool.join it takes too long time and I guess there is something wrong. Next step I would like to compare the output values (mean and std) by iterating through all file swith a sliding window: all_segments.
In a shorter description:
For 10 random values iterate through file with a selected window
Calculate mean for window and store the output into a list. For the list calculation the mean of the means and standard deviation.
For every sliding window in files, calculate the mean and calculate a z-score against the previous mean and standard deviation obtained.
def random_segments(j):
cov_list=[]
cov = []
lines = list(islice(f, j, j+1000))
for line in lines:
cov.append(float(line.split("\t")[2]))
mc1 = sum(cov)/len(cov)
cov_list.append(mc1)
return cov_list
def all_segments(j):
cov_list=[]
cov = []
lines = list(islice(f, j, j+1000))
for line in lines:
cov.append(float(line.split("\t")[2]))
mc2 = sum(cov)/len(cov)
z = (mc2 - mean) / sd
print (z)
if z > 10 or z < -10:
print (line)
if __name__ == '__main__':
for cv_file in os.listdir("."):
if cv_file.endswith(".coverage.out"):
f = open(cv_file, 'r').readlines()
if args.ws == False:
args.ws = 1000
size = len(f)
print (cv_file + "\t" + str(size))
perc= float(args.rn)/100 * int(size)
perc = perc // 1
print(perc)
pool=mp.Pool(int(args.proc))
rn=[random.randint(1,int(size)-args.ws) for _ in range(10)]
data = pool.map(random_segments, [i for i in rn])
data = [ent for sublist in data for ent in sublist]
sd, variance, mean = mean_std(data)

Finding correlation coefficient from 2 lists

I am working on a project which has many functions when given a couple lists of data. I've already seperated the lists and I have defined some functions which I know for certain work correctly, that being a mean function and standard deviation function. My issue is when testing my lists I get a correct mean, correct standard deviation, but incorrect correlation coefficient. Could my math be off here? I need to find the correlation coefficient with only Python's standard library.
MY CODE:
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
zList1 = []
zList2 = []
# Create 2 new lists taking (a[i]-a's Mean)/standard deviation of a
for x in someList1:
z1 = ((float(x)-xMean)/xStandDev)
zList1.append(z1)
for y in someList2:
z2 = ((float(y)-yMean)/yStandDev)
zList2.append(z2)
# Mapping out the lists to be float values instead of string
zList1 = list(map(float,zList1))
zList2 = list(map(float,zList2))
# Multiplying each value from the lists
zFinal = [a*b for a,b in zip(zList1,zList2)]
totalZ = 0
# Taking the sum of all the products
for a in zFinal:
totalZ += a
# Finally calculating correlation coefficient
r = (1/(len(someList1) - 1)) * totalZ
return r
SAMPLE RUN:
I have a list of [1,2,3,4,4,8] and [3,3,4,5,8,9]
I expect the correct answer of r = 0.8848, but get r = .203727
EDIT: To include the mean and standard deviation functions I have made.
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
newList = []
sdTotal = 0
listMean = mean(someList)
for a in someList:
newNum = (float(a) - listMean)**2
newList.append(newNum)
for z in newList:
sdTotal += float(z)
standardDeviation = sdTotal/(len(newList))
return standardDeviation
The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]
Pearson Correlation Coefficient
Code (modified)
def mean(someList):
total = 0
for a in someList:
total += float(a)
mean = total/len(someList)
return mean
def standDev(someList):
listMean = mean(someList)
dev = 0.0
for i in range(len(someList)):
dev += (someList[i]-listMean)**2
dev = dev**(1/2.0)
return dev
def correlCo(someList1, someList2):
# First establish the means and standard deviations for both lists.
xMean = mean(someList1)
yMean = mean(someList2)
xStandDev = standDev(someList1)
yStandDev = standDev(someList2)
# r numerator
rNum = 0.0
for i in range(len(someList1)):
rNum += (someList1[i]-xMean)*(someList2[i]-yMean)
# r denominator
rDen = xStandDev * yStandDev
r = rNum/rDen
return r
print(correlCo([1,2,3,4,4,8], [3,3,4,5,8,9]))
Output
0.884782972876
Normally according to the standard deviation formula, you should have divided the dev to the sample number (length of the list) before sqrrt it.Right?
I mean:
dev += ((someList[i]-listMean)**2)/len(someList)
enter image description here
Your standard deviation is wrong. You forgot to take the squareroot.
You are actually returning variance and not standard deviation from that function.
#DeathPox

Error with user input for standard deviation program

My program is meant to calculate the standard deviation for 5 values given by the users. There is an issue with my code when getting the input in a for loop. Why is that?
givenValues = []
def average(values):
for x in range(0, 6):
total = total + values[x]
if(x==5):
average = total/x
return average
def sqDiff(values):
totalSqDiff = 0
sqDiff = []
av = average(values)
for x in range(0,6):
sqDiff[x] = (values[x] - av)**2
totalSqDiff = totalSqDiff + sqDiff[x]
avSqDiff = totalSqDiff / 5
SqDiffSquared = avSqDiff**2
return SqDiffSquared
for counter in range(0,6):
givenValues[counter] = float(input("Please enter a value: "))
counter = counter + 1
sqDiffSq = sqDiff(givenValues)
print("The standard deviation for the given values is: " + sqDiffSq)
There are several errors in your code.
Which you can easily find out by reading the errormessages your code produces:
in the Function average
insert the line total = 0
you are using it before asigning it.
List appending
Do not use for example
sqDiff[x] = (values[x] - av)**2
You can do this when using dict's but not lists! Since python cannot be sure that the list indices will be continuously assigned use sqDiff.append(...) instead.
Do not concatenate strings with floats. I recommend to read the PEP 0498
(https://www.python.org/dev/peps/pep-0498/) which gives you an idea on how string could/should be formated in python

Categories