ChiSquare calculation returning all zeros - python

EDIT: after more trial and error, I figured out that for some reason, python says that 1/52 is 0, can anyone explain me why, so I can avoid this problem in the future?
I've been struggling with a script for a while now, mainly because me or my fellow students simply can't find out what's wrong with it.
Trying to keep things simple, we've got data and a model and we have to rescale some of the datapoints to the model and then do a chi2square minimalization in order to find the best rescaling factor.
I've tried multiple things already. Tried putting everything in 1 loop, when that didn't work, I tried splitting the loops up etc.
The relevant part of my code looks like this:
#Here I pick the values of the model that correspond to the data
y4 = np.zeros((len(l),1))
for x in range(0,len(l)):
if l[x] < 2.16:
for y in range(0,len(lmodel)):
if lmodel[y] == l[x]:
y4[x] = y2[y]
elif lmodel[y] < l[x] < lmodel[y+1]:
y4[x] = (y2[y] + y2[y+1])/2
else:
y4[x] = y1[x]
#Do Chi2 calculation
#First, I make a matrix with all the possible rescaled values
chi2 = np.zeros((200,1))
y3 = np.zeros((len(l),len(chi2)))
for z in range(0,len(chi2)):
for x in range(0,len(l)):
if l[x] < 2.16:
y3[x,z] = y1[x]*10**(0.4*Al[x]*z/100)
else:
y3[x,z] = y1[x]
#Here I calculate the chisquare for each individual column and put it in the chi2 array
dummy = np.zeros((len(l),1))
for x in range(0,len(chi2)):
for t in range(0, len(l)):
dummy[t] = (1/52)*((y3[t,x] - y4[t])/fle[t])**2
chi2[x] = np.sum(dummy)
The thing is that no matter what I try, for some reason, my dummy array is always all zeros, making every single chi square value 0.
I've tried making 'dummy' a matrix and summing afterwards, I've tried printing individual values for the calculation of the dummy[t]'s, and some of them were 0 (as expected), some weren't, so logically, if the individual values aren't all 0, neither should every value in dummy be.
I just can't find where I go wrong, and why I keep getting arrays of zeros.

In Python 2 (which most people are still using), 1 / 52 is an integer division, so returns 0. You can fix it by explicitly using floating point numbers, e.g. 1.0 / 52.
In Python 3, this is no longer true--dividing two integers can return a float.

Related

How can I solve for x with Ax=B, when A and X are 1-d arrays and I know A?

In my original code I have the following function:
B = np.inner(A,x)
where A.shape = [307_200] and has values -1 or 1
where x.shape = [307_200] and has values 0 to 256
where B results in a integer with a large value.
Assuming I know A and B, but don't know x, how can I solve for x??
To simplify the problem...
import numpy as np
A = np.random.choice(a=[-1,1], size=10)
x = np.random.choice(a=range(0,256), size=10)
B = np.inner(A, x)
I want to solve for x now. So something like one of the following...
x_solved = np.linalg.solve(A,x)
x_solved = np.linalg.lstsq(A,x)
Is it possible?
Extra info...
I could change A to be a n x m matrix, but since I am dealing with large matrices, when I try to use lstsq I quickly run out of memory. This is bad because 1. I can't run on my local machine and 2. the end use application needs to limit RAM.
However, for the problem above, I can except RAM intensive solutions since I might be able to moderate the compute resources with some cleaver tricks.
Also, we could switch A to boolean values if that would help.
Apologies if solution is obvious or simple.
Thanks for helps.
Here is your problem re-stated:
I have an array A containing many 1s and -1s. I want to make another array x containing integers 0-255 so that when I multiply each entry by the corresponding first array, then add up all the entries, I get some target number B.
Notice that the problem is just as difficult if you shuffle the array elements. So let's shuffle them so all the 1s are at the start and all the -1s are at the end. After solving this simplified version of the problem, we can shuffle them back.
Now the simplified problem is this:
I have A1 number of 1s and A-1 number of -1s. I want to make two arrays x1 and x-1 containing numbers from 0-255 so that when I add all the numbers in x1 and subtract all the numbers in x-1 I get some target number B.
Can you work out how to solve this?
I'd start by filling x1 with numbers 255 until the next 255 would make the sum too high, then fill the next entry with the number that makes the sum equal the target, then fill the rest with 0s. Then fill x-1 with 0s. If the target number is negative, do the opposite. Then un-shuffle it - match up the x1 and x-1 arrays with positions of the the 1s and -1s in your array A. And you're done.
You can actually write that algorithm so it puts the numbers directly in x without needing to make the temporary arrays x1 and x-1.

Why does np.add.at() return the wrong answer for large arrays?

I have a large data set, statistic, with statistic.shape = (1E10,) that I want to effectively bin (sum) into an array of zeros, out = np.zeros(1E10). Each entry in statistic has a corresponding index, idx, which tells me in which out bin it belongs. The indices are not unique so I cannot use out += statistic[idx] since this will only count the first time a particular index is encountered. Therefore I'm using np.add.at(out, idx, statistic). My problem is that for very large arrays, np.add.at() returns the wrong answer.
Below is an example script that shows this behaviour. The function check_add() should return 1.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N)
np.add.at(out, np.arange(N), np.ones(N))
return np.sum(out)/N
n_arr = [1E3, 1E5, 1E8, 1E10]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
This example returns for me:
N = 1000.0 (log(N) = 3.0); output ratio is 1.0
N = 100000.0 (log(N) = 5.0); output ratio is 1.0
N = 100000000.0 (log(N) = 8.0); output ratio is 1.0
N = 10000000000.0 (log(N) = 10.0); output ratio is 0.1410065408
Can someone explain to me why the function fails for N=1E10?
This is an old bug, NumPy issue 13286. ufunc.at was using a too-small variable for the loop counter. It got fixed a while ago, so update your NumPy. (The fix is present in 1.16.3 and up.)
You're overflowing int32:
1E10 % (np.iinfo(np.int32).max - np.iinfo(np.int32).min + 1) # + 1 for 0
Out[]: 1410065408
There's your weird number (googling that number actually got me to here which is how I figured this out.)
Now, what's happening in your function is a bit more weird. By the documentation of ufunc.at you should just be accumulate-adding the 1 values in the indices that are lower than np.iinfo(np.int32).max and the negative indices above np.iinfo(np.int32).min - but it seems to be 1) working backwards and 2) stopping when it gets to the last overflow. Without digging into the c code I couldn't tell you why, but it's probably a good thing it does - your function would fail silently and with the "correct" mean if it had done things this way, while corrupting your results (having 2 or 3 in those indices and 0 in the middle).
It is most likely due to integer precision indeed. If you play around with the numpy data-type (e.g. you constrain it to an (unsigned) value between 0-255) by setting uint8, you will see that they ratios start declining already for the second array. I do not have enough memory to test it, but setting all dtypes to uint64 as below should help:
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint64')
np.add.at(out, np.arange(N,dtype='uint64'), 1)
return np.sum(out)/N
To understand the behavior, I recommend setting dtype='uint8' and checking the behavior for smaller N. So what happens is that the np.arange function creates ascending integers for the vector elements until it reaches the integer limit. It then starts again at 0 and counts up again, so at the beginning (smaller Ns) you get correct sum (although your out vector contains a lot of elements >1 in the positions 0:limit and a lot of elements = 0 beyond the limit). If however you choose N large enough, the elements in your out vector start exceeding the integer limit and start again from 0. As soon as that happens your sum is vastly off. To double-check, realize that the uint8 limit is 255(256 integers) and 256^2=65536. Set N = 65536 with dtype='uint8' and check_add(65536) will return 0.
import numpy as np
def check_add(N):
N = int(N)
out = np.zeros(N,dtype='uint8')
np.add.at(out, np.arange(N,dtype='uint8'), 1)
return np.sum(out)/N
n_arr = [1E1, 1E3, 1E5,65536, 1E7]
for n in n_arr:
print('N = {} (log(N) = {}); output ratio is {}'.format(n, np.log10(n), check_add(n)))
Also note, that you don't need the np.ones vector but can simply replace it by 1, if all you care about is uniformly incrementing everything by 1.
Guessing as I couldn't run it, but could it be a problem that you are exceeding max integer value in python for the last option? Ie exceeds 2147483647.
Use longinteger type instead as per below.
Referring to: [enter link description here][1]https://docs.python.org/2.0/ref/integers.html
Hope this helps. Please let me know if it does work.

Rating the success of a classifier by comparing return boolean (1/0) to given value in 2D array

I have an array "D" that contains dogs and their health conditions.
The classifier() method returns either 1 or 0 and takes one row of the 2D array as input.
I want to compare the classifier result to column 13 of the 2D array
In an ideal case the classifier would always return the same value as specified in that column.
Now I try to calculate the total hitrate of the classifier by adding up successes and dividing it by the total number of results.
So far I have worked out an enumerate for loop to hand over rows to the classifier in sequence.
def accuracy(D, classifier):
for i, item in enumerate(D):
if classifier(item)==D[i,13]
#Compare result of classifier with actual value
x+=1 #Increase x on a hit
acc=(x/D.length)
#Divide x by length of D to calculate hitrate eg. "0.5"; 100% would be "1"
return acc
There is probably a simple formatting error somewhere or I have an error in my logic.
(Am 2 Days into Python now)
I think I might not be doing the if compare correctly.
Assuming both D and classifier are defined, there are some errors in your code which should all give reasonable error messages (apart from the float casting, that one can be tricky with python).
You're both missing a : in the if-query, as well as you're trying to access the array D like D[i, 13] which isn't allowed. 2D-arrays is accessed with another set of [], like D[i][13]. However, since you're already enumerating the 2D-array, you may as well use the item[13] to get the value.
Lastly, if you want a decimal value at the end you'll also need to cast at least one of the values to a float, like float(x)/D.length, otherwise it will just round it to 0 or 1.
Fixed code:
for i, item in enumerate(D):
if classifier(item) == D[i][13]:
# if classifier(item) == item[13]: # This should also work, you can use either.
x += 1 #Increase x on a hit
acc = (float(x)/D.length)
# Divide x by length of D to calculate hitrate eg. "0.5"; 100% would be "1"
return acc

Using other data with a function of the form f(x,y) = f(x,y,z).sum(axis=-1)

So, in my previous question wflynny gave me a really neat solution (Surface where height is a function of two functions, and a sum over the third). I've got that part working for my simple version, but now I'm trying to improve on this.
Consider the following lambda function:
x = np.arange(0,100, 0.1)
y = np.sin(y);
f = lambda xx: (xx-y[x=xx])**2
values = f(x)
Now, in this scenario it works. In fact, the [x=xx] is trivial in the example. However, the example can be extended:
x = np.arange(0,100, 0.1)
z = np.sin(y);
f = lambda xx, yy: ( (xx-z[x=xx])**2 + yy**2)**0.5
y = np.arange(0,100,0.1)
[xgrid, ygrid] = np.meshgrid(x,y);
values = f(xgrid,ygrid)
In this case, the error ValueError: boolean index array should have 1 dimension is generated. This is because z.shape is different from xgrid.shape, I think.
Note that here, y=np.sin(y) is a simplification. It's not a function but an array of arbitrary values. We really need to go to that array to retrieve them.
I do not know what the proper way to implement this is. I am going to try some things, but I hope that somebody here will give me hints or provide me with the proper way to do this in Python.
EDIT: I originally thought I had solved it by using the following:
retrieve = lambda pp: map(lambda pp: dataArray[pp==phiArray][0], phi)
However, this merely returns the dataArray. Suppose dataArray contains a number of 'maximum' values for the polar radius. Then, you would normally incorporate this by saying something like g = lambda xx, yy: f(xx,yy) * Heaviside( dataArray - radius(xx,yy)). Then g would properly be zero if the radius is too large.
However, this doesn't work. I'm not fully sure but the behaviour seems to be something like taking a single value of dataArray instead of the entire array.
Thanks!
EDIT: Sadly, this stuff has to work and I can't spend more time on making it nice. Therefore, I've opted for the dirty implementation. The actual thing I was interested in would be of the sort as the g = lambda xx, yy written above, so I can implement that directly (dirty) instead of nicely (without nested for loops).
def envelope(xx, yy):
value = xx * 0.
for i in range(0,N): #N is defined somewhere, and xx.shape = (N,N)
for j in range(0,N):
if ( dataArray[x=xx[i,j]][0] > radius(xx[i,j],yy[i,j])):
value[i,j] = 1.
else:
value[i,j] = 0.
return value
A last resort, but it works. And, sometimes results matter over writing good code, especially when there's a deadline coming up (and you are the only one that cares about good code).
I would still be very much interested in learning how to do this properly, if there is a proper way, and thus increase my fluency in clean Python.

Interpolation of sin(x) using Python

I am working on a homework problem for which I am supposed to make a function that interpolates sin(x) for n+1 interpolation points and compares the interpolation to the actual values of sin at those points. The problem statement asks for a function Lagrangian(x,points) that accomplishes this, although my current attempt at executing it does not use 'x' and 'points' in the loops, so I think I will have to try again (especially since my code doesn't work as is!) However, why I can't I access the items in the x_n array with an index, like x_n[k]? Additionally, is there a way to only access the 'x' values in the points array and loop over those for L_x? Finally, I think my 'error' definition is wrong, since it should also be an array of values. Is it necessary to make another for loop to compare each value in the 'error' array to 'max_error'? This is my code right now (we are executing in a GUI our professor made, so I think some of the commands are unique to that such as messages.write()):
def problem_6_run(problem_6_n, problem_6_m, plot, messages, **kwargs):
n = problem_6_n.value
m = problem_6_m.value
messages.write('\n=== PROBLEM 6 ==========================\n')
x_n = np.linspace(0,2*math.pi,n+1)
y_n = np.sin(x_n)
points = np.column_stack((x_n,y_n))
i = 0
k = 1
L_x = 1.0
def Lagrange(x, points):
for i in n+1:
for k in n+1:
return L_x = (x- x_n[k] / x_n[i] - x_n[k])
return Lagrange = y_n[i] * L_x
error = np.sin(x) - Lagrange
max_error = 0
if error > max_error:
max_error = error
print.messages('Maximum error = &g' % max_error)
plot.draw_lines(n+1,np.sin(x))
plot.draw_points(m,Lagrange)
plots.draw_points(m,error)
Edited:
Yes, the different things ThiefMaster mentioned are part of my (non CS) professor's environment; and yes, voithos, I'm using numpy and at this point have definitely had more practice with Matlab than Python (I guess that's obvious!). n and m are values entered by the user in the GUI; n+1 is the number of interpolation points and m is the number of points you plot against later.
Pseudocode:
Given n and m
Generate x_n a list of n evenly spaced points from 0 to 2*pi
Generate y_n a corresponding list of points for sin(x_n)
Define points, a 2D array consisting of these ordered pairs
Define Lagrange, a function of x and points
for each value in the range n+1 (this is where I would like to use points but don't know how to access those values appropriately)
evaluate y_n * (x - x_n[later index] / x_n[earlier index] - x_n[later index])
Calculate max error
Calculate error interpolation Lagrange - sin(x)
plot sin(x); plot Lagrange; plot error
Does that make sense?
Some suggestions:
You can access items in x_n via x_n[k] (to answer your question).
Your loops for i in n+1: and for k in n+1: only execute once each, one with i=n+1 and one with k=n+1. You need to use for i in range(n+1) (or xrange) to get the whole list of values [0,1,2,...,n].
in error = np.sin(x) - Lagrange: You haven't defined x anywhere, so this will probably result in an error. Did you mean for this to be within the Lagrange function? Also, you're subtracting a function (Lagrange) from a number np.sin(x), which isn't going to end well.
When you use the return statement in your def Lagrange you are exiting your function. So your loop will never loop more than once because you're returning out of the function. I think you might actually want to store those values instead of returning them.
Can you write some pseudocode to show what you'd like to do? e.g.:
Given a set of points `xs` and "interpolated" points `ys`:
For each point (x,y) in (xs,ys):
Calculate `sin(x)`
Calculate `sin(x)-y` being the difference between the function and y
.... etc etc
This will make the actual code easier for you to write, and easier for us to help you with (especially if you intellectually understand what you're trying to do, and the only problem is with converting that into python).
So : try fix up some of these points in your code, and try write some pseudocode to say what you want to do, and we'll keep helping you :)

Categories