Efficient way to find index of interval - python

I'm writing a spline class in Python. The method to calculate the the spline interpolated value requires the index of the closest x data points. Currently a simplified version looks like this:
def evaluate(x):
for ii in range(N): # N = len(x_data)
if x_data[ii] <= x <= x_data[ii+1]:
return calc(x,ii)
So it iterates through the list of x_data points until it finds the lower index ii of interval in which x lies and uses that in the function calc, which performs the spline interpolation. While functional, it seems like this would be inefficient for large x_data arrays if x is close to the end of the data set. Is there a more efficient or elegant way to perform the same functionality, which does not require every interval to be checked iteratively?
Note: x_data may be assumed to be sorted so x_data[ii] < x_data[ii+1], but is not necessarily equally spaced.

this is exactly what bisect is for https://docs.python.org/2/library/bisect.html
from bisect import bisect
index = bisect(x_data,x)
#I dont think you actually need the value of the 2 closest but if you do here it is
point_less = x_data[index-1] # note this will break if its index 0 so you probably want a special case for that
point_more = x_data[index]
closest_value = min([point_less,point_more],key=lambda y:abs(x-y))
alternatively you should use binary search(in fact im pretty sure thats what bisect uses under the hood) .... it should be worst case O(log n) (assuming your input array is already sorted)

Related

Integer optimization/maximization in numpy

I need to estimate the size of a population, by finding the value of n which maximises scipy.misc.comb(n, a)/n**b where a and b are constants. n, a and b are all integers.
Obviously, I could just have a loop in range(SOME_HUGE_NUMBER), calculate the value for each n and break out of the loop once I reach an inflexion in the curve. But I wondered if there was an elegant way of doing this with (say) numpy/scipy, or is there some other elegant way of doing this just in pure Python (e.g. like an integer equivalent of Newton's method?)
As long as your number n is reasonably small (smaller than approx. 1500), my guess for the fastest way to do this is to actually try all possible values. You can do this quickly by using numpy:
import numpy as np
import scipy.misc as misc
nMax = 1000
a = 77
b = 100
n = np.arange(1, nMax+1, dtype=np.float64)
val = misc.comb(n, a)/n**b
print("Maximized for n={:d}".format(int(n[val.argmax()]+0.5)))
# Maximized for n=181
This is not especially elegant but rather fast for that range of n. Problem is that for n>1484 the numerator can already get too large to be stored in a float. This method will then fail, as you will run into overflows. But this is not only a problem of numpy.ndarray not working with python integers. Even with them, you would not be able to compute:
misc.comb(10000, 1000, exact=True)/10000**1001
as you want to have a float result in your division of two numbers larger than the maximum a float in python can hold (max_10_exp = 1024 on my system. See sys.float_info().). You couldn't use your range in that case, as well. If you really want to do something like that, you will have to take more care numerically.
You essentially have a nicely smooth function of n that you want to maximise. n is required to be integral but we can consider the function instead to be a function of the reals. In this case, the maximising integral value of n must be close to (next to) the maximising real value.
We could convert comb to a real function by using the gamma function and use numerical optimisation techniques to find the maximum. Another approach is to replace the factorials with Stirling's approximation. This gives a moderately complicated but tractable algebraic expression. This expression is not hard to differentiate and set to zero to find the extrema.
I did this and obtained
n * (b + (n-a) * log((n-a)/n) ) = a * b - a/2
This is not straightforward to solve algebraically but easy enough numerically (e.g. using Newton's method, as you suggest).
I may have made a mistake in the algebra, but I typed the a = 77, b = 100 example into Wolfram Alpha and got 180.58 so the approach seems to work.

Does this shuffling algorithm produce each permutation with uniform probability?

I've seen how a particular naive shuffling algorithm is biased, and I feel like I basically get that, and I get how the Fischer-Yates algorithm is not biased. I have the following algorithm which was the one I first thought of when I thought about how to shuffle a list. I know it consumes twice the memory and runs in unnecessarily large time, but I'm still curious if it produces each permutation with a uniform distribution, or if there's some sneaky reason I'm not seeing for it to be biased.
I'm also kind of wondering if there is some other "undesirable" property to a random shuffle that this would have, like perhaps the probabilities of various positions in the list being filled with some values are dependent.
def shuf(x):
out = [None for i in range(len(x))]
for i in x:
pos = rand.randint(0,len(x)-1)
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
out[pos] = i
return out
I generated a heat map of this on a list of 20 elements, running 10^6 trials, and it produced the following. The (i,j) coordinate of the map represents the probability of the ith position of the list being filled with the jth element of the original list.
While I don't see any pattern to the heat map, it looks like the variance might be high. Or that might be the heat map over-stating the variance because, hey, the minimum and max have to come up somewhere.
Undesirable property - this can be expensive if you're shuffling a large set:
while out[pos] != None:
pos = rand.randint(0,len(x)-1)
Imagine len(x) == 100,000,000 and you've placed 90,000,000 already - you're going to loop a LOT before you get a hit.
Interesting exercises:
What does the heat map look like for simply generating random numbers between 1 and len(x) over 10e6 iterations look like?
What does the heat map look like for Fischer-Yates, for comparison?
At a glance, it looks to me like, given a uniform RNG, it should yield a truly random distribution (albeit more slowly than Fischer-Yates).

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Interpolation of sin(x) using Python

I am working on a homework problem for which I am supposed to make a function that interpolates sin(x) for n+1 interpolation points and compares the interpolation to the actual values of sin at those points. The problem statement asks for a function Lagrangian(x,points) that accomplishes this, although my current attempt at executing it does not use 'x' and 'points' in the loops, so I think I will have to try again (especially since my code doesn't work as is!) However, why I can't I access the items in the x_n array with an index, like x_n[k]? Additionally, is there a way to only access the 'x' values in the points array and loop over those for L_x? Finally, I think my 'error' definition is wrong, since it should also be an array of values. Is it necessary to make another for loop to compare each value in the 'error' array to 'max_error'? This is my code right now (we are executing in a GUI our professor made, so I think some of the commands are unique to that such as messages.write()):
def problem_6_run(problem_6_n, problem_6_m, plot, messages, **kwargs):
n = problem_6_n.value
m = problem_6_m.value
messages.write('\n=== PROBLEM 6 ==========================\n')
x_n = np.linspace(0,2*math.pi,n+1)
y_n = np.sin(x_n)
points = np.column_stack((x_n,y_n))
i = 0
k = 1
L_x = 1.0
def Lagrange(x, points):
for i in n+1:
for k in n+1:
return L_x = (x- x_n[k] / x_n[i] - x_n[k])
return Lagrange = y_n[i] * L_x
error = np.sin(x) - Lagrange
max_error = 0
if error > max_error:
max_error = error
print.messages('Maximum error = &g' % max_error)
plot.draw_lines(n+1,np.sin(x))
plot.draw_points(m,Lagrange)
plots.draw_points(m,error)
Edited:
Yes, the different things ThiefMaster mentioned are part of my (non CS) professor's environment; and yes, voithos, I'm using numpy and at this point have definitely had more practice with Matlab than Python (I guess that's obvious!). n and m are values entered by the user in the GUI; n+1 is the number of interpolation points and m is the number of points you plot against later.
Pseudocode:
Given n and m
Generate x_n a list of n evenly spaced points from 0 to 2*pi
Generate y_n a corresponding list of points for sin(x_n)
Define points, a 2D array consisting of these ordered pairs
Define Lagrange, a function of x and points
for each value in the range n+1 (this is where I would like to use points but don't know how to access those values appropriately)
evaluate y_n * (x - x_n[later index] / x_n[earlier index] - x_n[later index])
Calculate max error
Calculate error interpolation Lagrange - sin(x)
plot sin(x); plot Lagrange; plot error
Does that make sense?
Some suggestions:
You can access items in x_n via x_n[k] (to answer your question).
Your loops for i in n+1: and for k in n+1: only execute once each, one with i=n+1 and one with k=n+1. You need to use for i in range(n+1) (or xrange) to get the whole list of values [0,1,2,...,n].
in error = np.sin(x) - Lagrange: You haven't defined x anywhere, so this will probably result in an error. Did you mean for this to be within the Lagrange function? Also, you're subtracting a function (Lagrange) from a number np.sin(x), which isn't going to end well.
When you use the return statement in your def Lagrange you are exiting your function. So your loop will never loop more than once because you're returning out of the function. I think you might actually want to store those values instead of returning them.
Can you write some pseudocode to show what you'd like to do? e.g.:
Given a set of points `xs` and "interpolated" points `ys`:
For each point (x,y) in (xs,ys):
Calculate `sin(x)`
Calculate `sin(x)-y` being the difference between the function and y
.... etc etc
This will make the actual code easier for you to write, and easier for us to help you with (especially if you intellectually understand what you're trying to do, and the only problem is with converting that into python).
So : try fix up some of these points in your code, and try write some pseudocode to say what you want to do, and we'll keep helping you :)

Methods for quickly calculating standard deviation of large number set in Numpy

What's the best(fastest) way to do this?
This generates what I believe is the correct answer, but obviously at N = 10e6 it is painfully slow. I think I need to keep the Xi values so I can correctly calculate the standard deviation, but are there any techniques to make this run faster?
def randomInterval(a,b):
r = ((b-a)*float(random.random(1)) + a)
return r
N = 10e6
Sum = 0
x = []
for sample in range(0,int(N)):
n = randomInterval(-5.,5.)
while n == 5.0:
n = randomInterval(-5.,5.) # since X is [-5,5)
Sum += n
x = np.append(x, n)
A = Sum/N
for sample in range(0,int(N)):
summation = (x[sample] - A)**2.0
standard_deviation = np.sqrt((1./N)*summation)
You made a decent attempt, but should make sure you understand this and don't copy explicitly since this is HW
import numpy as np
N = int(1e6)
a = np.random.uniform(-5,5,size=(N,))
standard_deviation = np.std(a)
This assumes you can use a package like numpy (you tagged it as such). If you can, there are a whole host of methods that allow you to create and do operations on arrays of data, thus avoiding explicit looping (it's done under the hood in an efficient manner). It would be good to take a look at the documentation to see what features are available and how to use them:
http://docs.scipy.org/doc/numpy/reference/index.html
Using the formulas found on this wiki page for Variance, you could compute it in one loop without storing a list of the random numbers (assuming you didn't need them elsewhere).

Categories