What went wrong with my Kruskal-Wallis class?

What went wrong with my Kruskal-Wallis class? - python

I was trying to build a class that could perform the Kruskal-Wallis test. The class uses the following formula to compute H:
However, it yields a different H-value than the kruskal function of scipy. Does anyone know why this is the case?
import numpy as np
from scipy.stats import rankdata
from scipy.stats import kruskal
class Kruskal_Wallis():
def __init__(self):
pass
def fit(self, groups):
"""
Performs Kruskal-Wallis test.
:param groups: list containing 1D group arrays
Adds the following attributes:
- n: size of total population
- n_groups: number of groups (n_groups = len(n_i) = len(r_i))
- n_i: array containing group sizes
- df: degrees of freedom
- r2_i: array containing the square of the sum of ranks for each group
- h: kruskal-wallis statistic
"""
def sum_ranks_per_group(groups):
n_groups = len(groups)
n_i = np.array([group.shape[0] for group in groups])
data = np.array([])
for group in groups:
data = np.concatenate((data, group), axis=0)
ranked_data = rankdata(data, method="average")
ranked_groups = ranked_data.reshape((n_groups, n_i[0])) #works only if groups have equal size
summed_ranks = ranked_groups.sum(axis=1)
return summed_ranks
def get_h(n, r2_i, n_i):
summed_r2_i_per_n_i = (r2_i/n_i).sum()
h = (12/(n*(n-1)) * summed_r2_i_per_n_i) - 3*(n+1)
return h
n_groups = len(groups)
n_i = np.array([group.shape[0] for group in groups])
n = sum(n_i)
df = n_groups - 1
r2_i = sum_ranks_per_group(groups)**2
h = get_h(n, r2_i, n_i)
self.n_groups = n_groups
self.n_i = n_i
self.n = n
self.df = df
self.r2_i = r2_i
self.h = h
## Compare results yielded by scipy.stats.kruskal and Kruskal_Wallis class
groups = [np.arange(1,3),
np.arange(3,5)]
res = kruskal(groups[0], groups[1])
kruskal_wallis = Kruskal_Wallis()
kruskal_wallis.fit(groups)
print(res)
print(kruskal_wallis.h)

the difference between the answers might be caused by the way python handles float type in the division operations.
Instead of using pythonic division (/) try using numpy's true division

Related

Weird results obtained while solving a set of coupled differential equations (using a sparse array) in python

I have tried to no avail for a week while trying to solve a system of coupled differential equations and reproduce the results shown in the attached image. I seem to be getting weird results as shown also. I don't seem to know what I might be doing wrong.The set of coupled differential equations were solved using Newman's BAND. Here's a link to the python implementation: python solution using BAND . And another link to the original image of the problem in case the attached is not clear enough: here you find a clearer image of the problem. Now what I am trying to do is to solve the same problem by creating a sparse array directly from the discretized equations using a combination of sympy and numpy and then solving using scipy's spsolve. Here is my code below. I need some help to figure out what I am doing wrong.
I have represented the variables as c1 = cA, c2 = cB, c3 = cC, c4 = cD in my code. Equation 2 has been linearized and phi10 and phi20 are the trial values of the variables cC and cD.
# import modules
import numpy as np
import sympy
from sympy.core.function import _mexpand
import scipy as sp
import scipy.sparse as ss
import scipy.sparse.linalg as ssl
import matplotlib.pyplot as plt
# define functions
def flatten(t):
"""
function to flatten lists
"""
return [item for sublist in t for item in sublist]
def get_coeffs(coeff_dict, func_vars):
"""
function to extract coefficients from variables
and form the sparse symbolic array
"""
c = coeff_dict
for i in list(c.keys()):
b, _ = i.as_base_exp()
if b == i:
continue
if b in c:
c[i] = 0
if any(k.has(b) for k in c):
c[i] = 0
return [coeff_dict[val] for val in func_vars]
# Constants for the problem
I = 0.1 # A/cm2
L = 1.0 # distance (x) in cm
m = 100 # grid spacing
h = L / (m-1)
a = 23300 # 1/cm
io = 2e-7 # A/cm2
n = 1
F = 96500 # C/mol
R = 8.314 # J/mol-K
T = 298 # K
sigma = 20 # S/cm
kappa = 0.06 # S/cm
alpha = 0.5
beta = -(1-alpha)*n*F/R/T
phi10 , phi20 = 5, 0.5 # these are just guesses
P = a*io*np.exp(beta*(phi10-phi20))
j = sympy.symbols('j',integer = True)
cA = sympy.IndexedBase('cA')
cB = sympy.IndexedBase('cB')
cC = sympy.IndexedBase('cC')
cD = sympy.IndexedBase('cD')
# write the boundary conditions at x = 0
bc=[cA[1], cB[1],
(4/3) * cC[2] - (1/3)*cC[3], # use a three point approximation for cC_prime
cD[1]
]
# form a list of expressions from the boundary conditions and equations
expr=flatten([bc,flatten([[
-cA[j-1] - cB[j-1] + cA[j+1] + cB[j+1],
cB[j-1] - 2*h*P*beta*cC[j] + 2*h*P*beta*cD[j] - cB[j+1],
-sigma*cC[j-1] + 2*h*cA[j] + sigma * cC[j+1],
-kappa * cD[j-1] + 2*h * cB[j] + kappa * cD[j+1]] for j in range(2, m)])])
vars = [cA[j], cB[j], cC[j], cD[j]]
# flatten the list of variables
unknowns = flatten([[cA[j], cB[j], cC[j], cD[j]] for j in range(1,m)])
var_len = len(unknowns)
# # # substitute in the boundary conditions at x = L while getting the coefficients
A = sympy.SparseMatrix([get_coeffs(_mexpand(i.subs({cA[m]:I}))\
.as_coefficients_dict(), unknowns) for i in expr])
# convert to a numpy array
mat_temp = np.array(A).astype(np.float64)
# you can view the sparse array with this
fig = plt.figure(figsize=(6,6))
ax = fig.add_axes([0,0, 1,1])
cmap = plt.cm.binary
plt.spy(mat_temp, cmap = cmap, alpha = 0.8)
def solve_sparse(b0, error):
# create the b column vector
b = np.copy(b0)
b[0:4] = np.array([0.0, I, 0.0, 0.0])
b[var_len-4] = I
b[var_len-3] = 0
b[var_len-2] = 0
b[var_len-1] = 0
print(b.shape)
old = np.copy(b0)
mat = np.copy(mat_temp)
b_2 = np.copy(b)
resid = 10
lss = 0
while lss < 100:
mat_2 = np.copy(mat)
for j in range(3, var_len - 3, 4):
# update the forcing term of equation 2
b_2[j+2] = 2*h*(1-beta*old[j+3]+beta*old[j+4])*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# update the sparse array at every iteration for variables cC and cD in equation2
mat_2[j+2, j+3] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
mat_2[j+2, j+4] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# form the column sparse matrix
A_s = ss.csc_matrix(mat_2)
new = ssl.spsolve(A_s, b_2).flatten()
resid = np.sum((new - old)**2)/var_len
lss += 1
old = np.copy(new)
return new
val0 = np.array([[0.0, 0.0, 0.0, 0.0] for _ in range(m-1)]).flatten() # form an array of initial values
error = 1e-7
## Run the code
conc = solve_sparse(val0, error).reshape(m-1, len(vars))
conc.shape # gives (99, 4)
# Plot result for cA:
plt.plot(conc[:,0], marker = 'o', linestyle = '')

What happens seems pretty clear now, after having seen that the plotted solution indeed oscillates between the upper and lower values. You are using the central Euler method as discretization, for u'=F(u) this reads as
u[j+1]-u[j-1] = 2*h*F(u[j])
This method is only weakly stable and allows the sub-sequences of odd and even indices to evolve rather independently. As equation this would mean that the solution might approximate the system ue'=F(uo), uo'=F(ue) with independent functions ue, uo that follow the path of the even or odd sub-sequence.
These even and odd parts are only tied together by the treatment of the boundary points, two or three points deep. So to avoid or reduce the oscillation requires a very careful handling of boundary conditions and also the differential equations for the boundary points.
But one can avoid all this unpleasantness by using the trapezoidal method
u[j+1]-u[j] = 0.5*h*(F(u[j+1])+F(u[j]))
This also reduces the band-width of the system matrix.
To properly implement the implied Newton method correctly (linearizing via Taylor and solving the linearized equation is what the Newton-Kantorovich method does) you need to replace F(u[j]) with F(u_old[j])+F'(u_old[j])*(u[j]-u_old[j]). This then gives a linear system of equations in u for the iteration step.
For the trapezoidal method this gives
(I-0.5*h*F'(u_old[j+1]))*u[j+1] - (I+0.5*h*F'(u_old[j]))*u[j]
= 0.5*h*(F(u_old[j+1])-F'(u_old[j+1])*u_old[j+1] + F(u_old[j])-F'(u_old[j])*u_old[j])
In general, the derivatives values and thus the system matrix need not be updated every step, only the function value (else the iteration does not move forward).

Minimise a multivariate function using scipy.optimize.fmin_bfgs

I am trying to minimize this multivariate
where αi are constants (could be both positive or negative) and n is fixed,
using the scipy.optimize.fmin_bfgs function.
Conditions:
Test the code for a random natural number n between 5 and 10
A random starting point (all of the form m.dddd)
Do the iterations till the successive iterates are less than 2% in
absolute value, in the l∞ norm.
The coefficients αi (of the form m.dddd) should be chosen randomly so
that at least 40% of them are negative and at least 25% of them are
positive.
This is what I have tried (for custom callback refered to https://stackoverflow.com/a/30365576/7906671),
import numpy as np
from scipy.optimize import minimize
from scipy.optimize import fmin_bfgs
#Generate a list of random positive and negative integers
random_list = np.random.uniform(-1, 1, size=(1, 10))[0].tolist()
p = []
n, npr = [], []
for r in range(len(random_list)):
if random_list[r] < 0:
n.append(random_list[r])
npr.append((str(random_list[r]), 0.4))
else:
p.append(random_list[r])
npr.append((str(random_list[r]), 0.25))
#Function to pick negative number with 40% probability and positive numbers with 25% probability
def w_choice(seq):
total_prob = sum(item[1] for item in seq)
chosen = np.random.uniform(0, total_prob)
cumulative = 0
for item, probality in seq:
cumulative += probality
if cumulative > chosen:
return item
#Random start value with m.dddd and size of the input array is between 5 and 10
n = np.random.randint(5, 10)
x0 = np.round(np.random.randn(n,1), 4)
alpha = []
for i in range(n):
alpha.append(np.round(float(w_choice(npr)), 4))
print("alpha: ", alpha)
def func(x):
return sum(alpha*(x**2.0))
class StopOptimizingException(Exception):
pass
class CallbackCollector:
def __init__(self, f, thresh):
self._f = f
self._thresh = thresh
def __call__(self, xk):
if self._f(xk) < self._thresh:
self.x_opt = xk
cb = CallbackCollector(func, thresh=0.02)
x, _, _ = fmin_bfgs(func, x0, callback=cb)
But this does not converge and gives the following :
Warning: Desired error not necessarily achieved due to precision loss.
I am not able to figure out why this fails. Any help is appreciated!

Faster exhaustive research using numpy

I'm trying to maximize the minimum between two function using exhaustive research, this solution work but loop in python consumes a lot of computing time. is there an efficient way to use numpy (mesh grid or vectorize) to solve this problem?
Code :
Functions below are used in the exhaustive research method
import numpy as np
def F1(x):
return (x/11)**10
def F2(x,y,z):
return z+x/y
def F3(x,y,z,a,b,c):
return ((x+y)**z)/((a-b)**c)
Exhaustive research method take 6 parameter (scalar or 1D array). for the moment I just want to compute my code on scalar, then I can use another function to browse those parameter if they are 1D array.
def B_F(P1, P2, P3,P4, P5, P6) :
# initializing my optimal parameters
a_Opt, b_opt, c_opt, obj_opt = 0, 0, 0, 0
# feasible set
a = np.linspace(0.0,1.0,10)
b = np.linspace(0.0,100.0,100)
c = np.linspace(0.0,100.0,100)
for i in a:
for j in b:
for k in c:
#if constraint is respected
if P1*k+P2*j+2*(i*k*j) <= F1(P3):
# calculate the max min of the two function
f_1 = F2(i,k,P6)
f_2 = F3(i,k,j,10,P4,P5)
min_f = np.minimum(f_1, f_2)
# extract optimal parameters and objective function
if obj_opt <= min_f :
a_Opt = i
b_opt = j
c_opt = k
obj_opt = min_f
exhaustive_research = np.array([[obj_opt, a_Opt, b_opt, c_opt]])
return exhaustive_research

You can do it this way:
A,B,C = np.meshgrid(a,b,c)
mask = P1*C+P2*B+2*(A*B*C) <= F1(P3)
A = A[mask]
B = B[mask]
C = C[mask]
f_1 = F2(A,C,P6)
f_2 = F3(A,C,B,10,P4,P5)
min_f = np.minimum(f_1, f_2)
ind = np.argmax(min_f)
obj_opt, a_Opt, b_opt, c_opt = min_f[ind], A[ind], B[ind], C[ind]

Crank Nicolson Method on Wave Function Python

I am trying to propagate a gaussian wave packet using the crank nicolson method in imaginary time (multiply the time step by the unit imaginary). The code that I have written in attempt to achieve this is shown here:
import matplotlib.pyplot as plt #this allows you to plot, and changes the name to plt
import numpy as np #this allows you to do math, and changes the name to np
import math
import scipy.linalg as la
def V(x):
# k = 1
# v = k*x**4
v = 0.25*(x-3)**2+0.15*(x-3)**4
return v
def Psi(x):
psi = np.exp(-2*(x-3)**2)
return psi
#Function for computing integral using trapezoid method
def TrapInt(y, h):
trap = [(float(y[ii]) + float(y[ii+1])) for ii in range(0, len(y)-1)]
return float(h)/2*sum(trap)
N = 1000
L = 3;
h = 0.01
x = np.arange(0,6,h);
t = np.linspace(0,L,300);
t = 1j*t;
dt = t[1] - t[0]
dx = x[1] - x[0]
A = 1j*dt/(2*dx**2)
pot = V(x)
Q = np.zeros([len(x),len(x)],dtype = complex)
P = np.zeros([len(x),len(x)],dtype = complex)
wave = np.zeros([len(x),len(t)],dtype = complex)
wave[:,0] = Psi(x)
B = (1- 2*A - 1j*dt*pot)
for ii in range(0,len(x)-1):
Q[ii][ii] = -(B[ii])
P[ii][ii] = (B[ii])
Q[ii][ii+1] = (2-A)
P[ii][ii+1] = A
if ii >= 1:
Q[ii][ii-1] = -A
P[ii][ii-1] = A
plt.plot(wave[:,0])
for ii in range(0,len(t)-1):
one = np.matmul(P,wave[:,ii])
wave[:,ii+1] = np.matmul(la.inv(Q),one)
I can't seem to find any mathematical errors in my implementation of the crank nicolson method; however, whenever I try to run this it gives an error saying that Q is singular (has no inverse). I'm not sure why this is occurring. Any help is appreciated. Thanks

You never assign to Q[-1]. Zero rows have been known to produce singular matrices in some cases.
Also, don’t repeatedly invert the matrix. Probably don’t invert it at all, but rather store some decomposition of it to allow efficient calculation of Q-1x.

passing a function as an argument to a class

I have a function is given by :
import scipy.special
def p(z):
z0=1./3.;eta=1.0
value=eta*(z**2)*numpy.exp(-1*(z/z0)**eta)/scipy.special.gamma(3./eta)/z0**3
return value
I want to pass this function to the following class which is in the file called redshift_probability.py as an argument p:
import pylab
import numpy
import pylab
import numpy
class GeneralRandom:
"""This class enables us to generate random numbers with an arbitrary
distribution."""
def __init__(self, x = pylab.arange(-1.0, 1.0, .01), p = None, Nrl = 1000):
"""Initialize the lookup table (with default values if necessary)
Inputs:
x = random number values
p = probability density profile at that point
Nrl = number of reverse look up values between 0 and 1"""
if p == None:
p = pylab.exp(-10*x**2.0)
self.set_pdf(x, p, Nrl)
def set_pdf(self, x, p, Nrl = 1000):
"""Generate the lookup tables.
x is the value of the random variate
pdf is its probability density
cdf is the cumulative pdf
inversecdf is the inverse look up table
"""
self.x = x
self.pdf = p/p.sum() #normalize it
self.cdf = self.pdf.cumsum()
self.inversecdfbins = Nrl
self.Nrl = Nrl
y = pylab.arange(Nrl)/float(Nrl)
delta = 1.0/Nrl
self.inversecdf = pylab.zeros(Nrl)
self.inversecdf[0] = self.x[0]
cdf_idx = 0
for n in xrange(1,self.inversecdfbins):
while self.cdf[cdf_idx] < y[n] and cdf_idx < Nrl:
cdf_idx += 1
self.inversecdf[n] = self.x[cdf_idx-1] + (self.x[cdf_idx] - self.x[cdf_idx-1]) * (y[n] - self.cdf[cdf_idx-1])/(self.cdf[cdf_idx] - self.cdf[cdf_idx-1])
if cdf_idx >= Nrl:
break
self.delta_inversecdf = pylab.concatenate((pylab.diff(self.inversecdf), [0]))
def random(self, N = 1000):
"""Give us N random numbers with the requested distribution"""
idx_f = numpy.random.uniform(size = N, high = self.Nrl-1)
idx = pylab.array([idx_f],'i')
y = self.inversecdf[idx] + (idx_f - idx)*self.delta_inversecdf[idx]
return y
I don't know how to pass input argument x as an input parameter to function p(z) when I call the class
from redshift_probability import GeneralRandom
z_pdf=GeneralRandom()
If I do as following I get error:
z_pdf.set_pdf( x=numpy.arange(0, 1.5, .001),p(x),N=1000000)
How do I modify it?

I think you want to change GeneralRandom.__init__ to look like this:
def __init__(self, x = pylab.arange(-1.0, 1.0, .01), p_func=None, Nrl = 1000):
"""Initialize the lookup table (with default values if necessary)
Inputs:
x = random number values
p_func = function to compute probability density profile at that point
Nrl = number of reverse look up values between 0 and 1"""
if p_func is None:
self.p_val = pylab.exp(-10*x**2.0)
else:
self.p_val = p_func(x)
Then call it like this:
GeneralRandom(p_func=p)
That way, if you provide p_func it will be called with x as an argument, but if it's not provided, it gets set the same default as before. There's no need to call set_pdf explicitly, because it's called at the end of __init__.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What went wrong with my Kruskal-Wallis class? - python

the difference between the answers might be caused by the way python handles float type in the division operations. Instead of using pythonic division (/) try using numpy's true division

Related

Weird results obtained while solving a set of coupled differential equations (using a sparse array) in python

Minimise a multivariate function using scipy.optimize.fmin_bfgs

Faster exhaustive research using numpy

Crank Nicolson Method on Wave Function Python

passing a function as an argument to a class

Categories

Resources