Smoothing simulation data for a Chi-Square demonstration - python

I am trying to teach my students about Chi-Square while trapped here at home. I have made a video that should be mostly helpful, however I have been having trouble making a graph with the specific properties of the Chi-Square distribution. The shape is right, however there is a lot of noise. This is simulation data, so it will never be perfectly smooth, however this is a bit much.
I have been trying to smooth the data. I have gone as far as to round the data to the nearest tenth and perform a moving average (k = 3) in order to get a graph as presentable as this:
Chi-Squared Simulation df = 3, sample size = 100, samples = 100000, rounded and smoothed
Chi-Squared Simulation df = 3, sample size = 100, samples = 100000, not rounded, smoothed
A few things I have noticed while working on this problem. First, the spikes and dips seem to occur at predictable locations. Second, without the rounding, the graph seems to alternate back and forth between a spike and dip regularly. I think it may be possible that this is be due to some sort of binary precision problem. I have tried to account for this by switching to using numpy for my operations and forcing the data to be float64. This had no effect.
What I would like to know is either:
If this problem is caused by binary precision, how can I properly mitigate that?
If this cannot be solved in that way, is there a better smoothing operation I could use?
Thank you for the assistance. Code is below.
# Draw n samples of 25 and get Chi-Square list
chiSqrList = []
n = 100000
sampleSize = 100
j = 0
while j < n:
redTotal = 0
greenTotal = 0
yellowTotal = 0
blueTotal = 0
i = 0
while i < sampleSize:
x = random.random()
if x < redLim:
redTotal += 1
elif x < greenLim:
greenTotal += 1
elif x < yellowLim:
yellowTotal +=1
else:
blueTotal += 1
i += 1
observedBalls = np.array([redTotal, greenTotal, yellowTotal, blueTotal], dtype=np.float64)
expectedBalls = np.array([sampleSize*redBalls, sampleSize*greenBalls, sampleSize*yellowBalls, sampleSize*blueBalls], dtype=np.float64)
chiSqr = 0
chiSqr = np.power((observedBalls - expectedBalls), 2)/expectedBalls
chiSqr = np.sum(chiSqr)
chiSqr = round(chiSqr, 1)
chiSqrList.append(chiSqr)
j += 1
# Make count data
avgSqrDist = []
count = []
i = 0
for value in chiSqrList:
if len(avgSqrDist) == 0:
avgSqrDist.append(value)
count.append(1)
elif avgSqrDist[i] != value:
avgSqrDist.append(value)
count.append(1)
i += 1
else:
count[i] += 1
# Smooth curve
i = 0
smoothAvgSqrDist = []
smoothCount = []
while i < len(avgSqrDist)-2:
smoothCount.append((count[i]+count[i+1]+count[i+2])/3)
smoothAvgSqrDist.append(avgSqrDist[i+1])
i += 1

Related

The area & center of gravity of a polygon having non-uniform density of vertices? (in Python)

I would like to calculate the COG of a polygon shaped exactly like the contour map of my town. However, using the available database of borderpoints would produce a rigged result, since some places have much bigger density of borderpoints than others, so the center of gravity would be skewed towards these regions. I tried to equalise the density of vertices by producing this Python code:
import numpy as np
punkty = open("borderpoints.txt","r", encoding = "utf8")
tempp = []
a = []
for line in punkty:
for c in line:
if c != " ":
tempp.append(c)
else:
p = "".join(tempp)
a.append(p)
tempp = []
i = 0
x= []
y = []
fx = open("outx1.txt", "w")
fy = open("outy1.txt", "w")
while i<len(a)-1:
x.append(a[i])
fx.write(a[i])
fx.write("\n")
y.append(a[i+1])
fy.write(a[i+1])
fy.write("\n")
i= i+2
j = 0
jump = 20
newxs = []
newys = []
fnx = open("newxs.txt","w")
fny = open("newys.txt", "w")
while j<len(x):
L = np.sqrt(pow((float(y[j+1])-float(y[j])),2)+pow((float(x[j+1])-float(x[j])),2))
n = jump*L
interval = (float(y[j+1])-float(y[j]))/n
k = 1
slope = (float(x[j+1])-float(x[j]))/(float(y[j+1])-float(y[j]))
inters = float(x[j+1])-slope*float(y[j+1])
while k<n+1:
g = float(y[j])+k*interval
newxs.append(g)
fnx.write(str(g))
fnx.write("\n")
g = (slope*(float(y[j])+k*interval)+inters)
newys.append(g)
fny.write(str(g))
fny.write("\n")
k = k+1
j = j+2
k=1
newxs.append(x)
newys.append(y)
but in the result, the points were denser everywhere except places that were previously empty and were supposed to get populated by the algorithm.
The graphs of the map before the application of the algorithm and
after (some proportions may vary but the main problem is the empty spot).
What is the approach that I could use in solving this problem? How to make the points distributed equally or maybe it's possible to calculate the COG with some other method?
My aim is that the amount of points shouldn't determine the COG, but rather determine the position of polygon sides - these are most important here, but obviously there is no database for them and it's harder to calculate the COG having a lot of linear functions and their ranges.

For loop with while statement sticking on certain iterations

I have code which simulates a Markov jump process.
I am trying to increase the number of realisations, Rr whilst reducing the computation time. The code is runs for some values and then will randomly get stuck and I don't know why exactly. I think it may be to do with my while loop, although I am unsure on how to get around this. If anyone could provide some insight that would be great.
import numpy as np
import random
import matplotlib.pyplot as plt
import csv
import time
t = time.time()
#Initial jump value
n0 = 1
#Constant
cbar = 1
#Constant
L = 1
#Constant
tau = 2*L/cbar
#Empty list to store N series
N_array = []
#Empty list to store z series
z_array = []
#Number of initial realisations
Rr = 2000
for r in range(0,Rr):
#Set initial z series values to be zero
z = [0]
#Set initial jump process values to be n0
N = [n0]
#Set iteration to be zero
j = 0
#While the value of N[j] series (realisation r) is greater than zero and j is less than jmax
while N[j] > 0:
#Form next value in z series according to exponential clock, parameter n^2, since Lloc = 2, cbar = 1, L=1
#In python the argument of this function is beta = 1/lambda, hence 1/n^2
z.append(z[j] + np.random.exponential(1/N[j]**2))
#Pick jump at position j+1 to be N[j] -1 or +1 with prob 1/2
N.append(N[j] + np.random.choice([-1,1]))
#Update iteration
j = j+1
#Store N,z realisation if sum dz < 10 say
if sum(np.diff(z)) < 10:
N_array.append(N)
z_array.append(z)
print(r/RR*100,'%)
elapsed = time.time() - t
print('Elapsed time =',elapsed,'seconds')

TSP, algorithm gets stuck in local minimum

I am struggling to implement a program based on simulated annealing to solve the traveling salesman problem. All solutions I got are not satisfying and i have no clue how to improve my implementation. Obviously I'm not focusing on benchmarks, but only on finding the visually acceptable shortest path. If anyone might enlighten me I would be thankful.
# weight function, simple euclidean norm
def road(X,Y):
sum = 0
size = len(X) -1
for i in range(0,size):
sum +=math.sqrt((X[i]-X[i+1])**2 + (Y[i]-Y[i+1])**2)
return sum
def array_swap(X,Y,index_1,index_2):
X[index_1],X[index_2] = X[index_2],X[index_1]
Y[index_1],Y[index_2] = Y[index_2],Y[index_1]
def arbitrarty_swap(X,Y):
ran = len(X)-1
pick_1 = random.randint(0,ran)
pick_2 = random.randint(0,ran)
X[pick_1],X[pick_2] = X[pick_2],X[pick_1]
Y[pick_1],Y[pick_2] = Y[pick_2],Y[pick_1]
return pick_1, pick_2
N = 40
X = np.random.rand(N) * 100
Y = np.random.rand(N) * 100
plt.plot(X, Y, '-o')
plt.show()
best = road(X,Y)
X1 = X.copy()
Y1 = Y.copy()
#history of systems energy
best_hist = []
iterations = 100000
T = 1.02
B = 0.999
for i in range(0,iterations):
index_1, index_2 = arbitrarty_swap(X,Y)
curr = road(X,Y)
diff = (curr - best)
if diff < 0 :
best = curr
best_hist.append(best)
array_swap(X1,Y1,index_1,index_2)
elif math.exp(-(diff)/T) > random.uniform(0,1):
best_hist.append(curr)
T *=B
else:
array_swap(X,Y,index_1,index_2)
https://i.stack.imgur.com/A6hmd.png
I didn't run your code, but one thing I'd try is changing the SA implementation.
Currently, you have 100,000 iterations in one loop. I would break that into two. The outer loop controls the temperature and the inner loop is different runs in that temperature. Something like this (pseudo code):
t=0; iterations=1000; repeat=1000
while t <= repeat:
n = 0
while n <=iterations:
# your SA implementation.
n += 1 # increase your iteration count in each temperature
# in outer while,
t += 1
T *= B

Anyway to speed up this exponentially expensive python code?

I know I can use cython for a quick and dirty improved, but before that is there any way to speed up the code using pythonic way?
This code is aiming at generating features of polynomial on a hermite basis, can extended to any dimension, which is aiming a generalized feature generation for all possible cases in polynomials.
data_row = 4000;
n_components = 14;
q = n_components;
degree = 1
x= np.random.rand(data_row, n_components)
feature_list = []
feature_array = np.zeros((data_row, (degree + 1)**q))
from itertools import product
num = 0
for feature_combination in product(xrange(degree+1), repeat = q):
# iterate over all feature combinations
single_combination_feature = 1;
for i_component, current_hermite_degree in enumerate(feature_combination):
single_combination_feature *= polyval(hermitenorm(current_hermite_degree), x[:, i_component])
feature_array[:, num] = single_combination_feature
num += 1

Numpy speed up nested loop with fancy indexing

i currently implemented an algorithm which calculates a quality assesment of disparity maps based on total variation.
I'm relatively new to python, but already read numerous threads on speed up numpy code. Views vs Fancy indexing, tried using Cython, Vectorization of nested loops etc. I achieved a bit of speed up's but altogether, i ended more and more in messy code without achieving a proper speed up.
I wonder if someone can give me a hint if there is a clean and easy way to speed up this 2d loop.
TV is a 2D array with ~ 15k x 15k elements
footprint_ix and _iy are 2 lists of arrays which contain the index offset to the neighbor pixels if pixel x,y in a ringshaped manner. With m = 1 the 8 neighborpixels are selected, m = 2 the next 16, and so on
The algorithm sums up the neighbor pixels of x,y and increases m when a threshold TAU is not exceeded.
The best solution i come up with, so far uses row-wise multiprocessing.
# create footprints
footprint_ix = []
footprint_iy = []
for m in range(1, m_classes):
fp = np.ones((2 * m + 1, 2 * m + 1), dtype = int)
fp [ 1 : -1, 1 : -1] = 0
i, j = np.nonzero(fp)
i = i - m
j = j - m
footprint_ix.append(i)
footprint_iy.append(j)
m_classes = 21
for x in xrange( 0, rows):
for y in xrange ( 0, cols):
if disp[x,y] == np.inf:
continue
else:
tv_m = 0
for m_i in range (0, m_classes-1):
m = m_i + 1
try:
tv_m += np.sum( tv[footprint_ix[m_i] + x, footprint_iy[m_i] + y] ) / (8 * m)
except IndexError:
tv_m = np.inf
if tv_m >= TAU:
tv_classes[x,y] = m
break
if m == m_classes - 1:
tv_classes[x,y] = m

Categories