I have a 1D array of data and wish to extract the spatial variation. The standard way to do this which I wish to pythonize is to perform a moving linear regression to the data and save the gradient...
def nssl_kdp(phidp, distance, fitlen):
kdp=zeros(phidp.shape, dtype=float)
myshape=kdp.shape
for swn in range(myshape[0]):
print "Sweep ", swn+1
for rayn in range(myshape[1]):
print "ray ", rayn+1
small=[polyfit(distance[a:a+2*fitlen], phidp[swn, rayn, a:a+2*fitlen],1)[0] for a in xrange(myshape[2]-2*fitlen)]
kdp[swn, rayn, :]=array((list(itertools.chain(*[fitlen*[small[0]], small, fitlen*[small[-1]]]))))
return kdp
This works well but is SLOW... I need to do this 17*360 times...
I imagine the overhead is in the iterator in the [ for in arange] line... Is there an implimentation of a moving fit in numpy/scipy?
the calculation for linear regression is based on the sum of various values. so you could write a more efficient routine that modifies the sum as the window moves (adding one point and subtracting an earlier one).
this will be much more efficient than repeating the process every time the window shifts, but is open to rounding errors. so you would need to restart occasionally.
you can probably do better than this for equally spaced points by pre-calculating all the x dependencies, but i don't understand your example in detail so am unsure whether it's relevant.
so i guess i'll just assume that it is.
the slope is (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2) where the "2" is "squared" - http://easycalculation.com/statistics/learn-regression.php
for evenly spaced data the denominator is fixed (since you can shift the x axis to the start of the window without changing the gradient). the (ΣX) in the numerator is also fixed (for the same reason). so you only need to be concerned with ΣXY and ΣY. the latter is trivial - just add and subtract a value. the former decreases by ΣY (each X weighting decreases by 1) and increases by (N-1)Y (assuming x_0 is 0 and x_N is N-1) each step.
i suspect that's not clear. what i am saying is that the formula for the slope does not need to be completely recalculated each step. particularly because, at each step, you can rename the X values as 0,1,...N-1 without changing the slope. so almost everything in the formula is the same. all that changes are two terms, which depend on Y as Y_0 "drops out" of the window and Y_N "moves in".
I've used these moving window functions from the somewhat old scikits.timeseries module with some success. They are implemented in C, but I haven't managed to use them in a situation where the moving window varies in size (not sure if you need that functionality).
http://pytseries.sourceforge.net/lib.moving_funcs.html
Head here for downloads (if using Python 2.7+, you'll probably need to compile the extension itself -- I did this for 2.7 and it works fine):
http://sourceforge.net/projects/pytseries/files/scikits.timeseries/0.91.3/
I/we might be able to help you more if you clean up your example code a bit. I'd consider defining some of the arguments/objects in lines 7 and 8 (where you're defining 'small') as variables, so that you don't end row 8 with so many hard-to-follow parentheses.
Ok.. I have what seems to be a solution.. not an answer persay, but a way of doing a moving, multi-point differential... I have tested this and the result looks very very similar to a moving regression... I used a 1D sobel filter (ramp from -1 to 1 convolved with the data):
def KDP(phidp, dx, fitlen):
kdp=np.zeros(phidp.shape, dtype=float)
myshape=kdp.shape
for swn in range(myshape[0]):
#print "Sweep ", swn+1
for rayn in range(myshape[1]):
#print "ray ", rayn+1
kdp[swn, rayn, :]=sobel(phidp[swn, rayn,:], window_len=fitlen)/dx
return kdp
def sobel(x,window_len=11):
"""Sobel differential filter for calculating KDP
output:
differential signal (Unscaled for gate spacing
example:
"""
s=np.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]
#print(len(s))
w=2.0*np.arange(window_len)/(window_len-1.0) -1.0
#print w
w=w/(abs(w).sum())
y=np.convolve(w,s,mode='valid')
return -1.0*y[window_len/2:len(x)+window_len/2]/(window_len/3.0)
this runs QUICK!
Related
I am a beginner to Python. Currently I'm writing a code for developing a simple solver for non-linear ODE systems with initial value. The equations of the system are as follow.
The function of myu is evaluated first to get the value of myu, then used in dX/dt, dS/dt, and dDO/dt. At the next step, myu is evaluated again to get its new value based on new value of S and DO.
I am using General Linear Method (GLM), proposed by J. C. Butcher, as my method. This method use a transition matrix, which value and size depends on numerical method that we use. In this case, I use Runge Kutta Cash-Karp.
While you may find in the equation that D is also a function, here I set the value of D as a constant.
In initialization, the value of h is set first, to get the number of step. I create a vector named 'initValue', with 8 columns and 4 rows, consist of values of k for each equations (row 1 to 6), initial value for fourth order of the Runge Kutta (row 7. I set it to 0 since it just act as a 'predictor'), and initial value for each equations (row 8).
Transition matrix is created based on the GLM, which values inside it comes from the constants of stage equations (to find the value of k1 to k6) and step equations (to find the solutions) of Runge Kutta Cash-Karp.
In the looping, at the very first time, I simply add the initial values to an array named 'result'. At the first step, I simply multiple the transition matrix with vector 'initValue'. And at the next until final step, I initialize the 'initValue' based on result from previous step.
What I'm looking for is the solution which should look like this.
My code works if h is less than 1. I compare my result with result from scipy.integrate.odeint. But when I set h bigger than 1, it show different result than the result it should be. For example, in the code, I set h = 100, which means that it will only display the initial value and final value (when time = 100). While X and S should going upward, and DO and Xr going downward, mine was the opposite of them. The result from odeint when h is set to bigger than 1 show the same result with the expected solution.
I need help to fix my code so it can display the expected solution for any value of h.
Thank you.
Why do you expect any type of reasonable result for ridiculously large step sizes?
The most simple demonstration is y'=-y and the explicit Euler method. If you use step sizes smaller 1 you will get qualititively correct solutions. For step sizes smaller 0.1, you will start to get also quantitatively correct step sizes.
However, if you use a step size h=10, then the iteration
y[k+1]= y[k] - h*y[k] = -9*y[k]
will explode. The same also happens for higher order methods, sufficiently small step sizes give quantitatively correct results, medium step sizes can still give a qualitatively correct picture, large step sizes lead to errors that are very quickly larger than the solution values.
I have a problem that is equal parts trig and Python. I am plotting a cosine over time interval [0,t] whose frequency changes (slightly) according to another cosine function. So what I'd expect to see is a repeating pattern of higher-to-lower frequency that repeats over the duration of the window [0,t].
Instead what I'm seeing is that over time a low-freq motif emerges in the cosine plot and repeats over time, each time becoming lower and lower in freq until eventually the cosine doesn't even oscillate properly it just "wobbles", for lack of a better term.
I don't understand how this is emerging over the course of the window [0,t] because cosine is (obviously) periodic and the function modulating it is as well. So how can "new" behavior emerge?? The behavior should be identical across all periods of the modulatory cosine that tunes the freq of the base cosine, right?
As a note, I'm technically using a modified cosine, instead of cos(wt) I'm using e^(cos(wt)) [called von mises eq or something similar].
Minimum needed Code:
cos_plot = []
for wind,pos_theta in zip(window,pos_theta_vec): #window is vec of time increments
# for ref: DBFT(pos_theta) = (1/(2*np.pi))*np.cos(np.radians(pos_theta - base_pos))
f = float(baserate+DBFT(pos_theta)) # DBFT() returns a val [-0.15,0.15] periodically depending on val of pos_theta
cos_plot.append(np.exp(np.cos(f*2*np.pi*wind)))
plt.plot(cos_plot)
plt.show()
What you are observing could depend on "aliasing", i.e. the emergence of low-frequency figures because of sampling of an high frequency function with a step that is too big.
(picture taken from the linked Wikipedia page)
If the issue is NOT aliasing consider that any function shape between -1 and 1 can be obtained with cos(f(x)*x) by simply choosing f(x).
For, consider any function -1 <= g(x) <= 1 and set f(x) = arccos(g(x))/x.
To look for the problem try plotting your "frequency" and see if anything really strange is present in it. May be you've a bug in DBFT.
In the interest of posterity, in case anyone ever needs an answer to this question:
I wanted a cosine whose frequency was a time-varying function freq(t). My mistake was simply evaluating this function at each time t like this: Acos(2pifreq(t)t). Instead you need to integrate freq(t) from 0 to t at each time point: y = cos(2%piintegral(f(t)) + 2%pi*f0*t + phase). The term for this procedure is a frequency sweep or chirp (not identical terms, but similar if you need to google/SO answers).
Thanks to those who responded with help :)
SB
I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).
I've been struggling with an algorithm tied to comparisons with 3d triangle vectors. Unfortunately its very slow in places and I've gone back and forth on different methods to try and improve it. One thing I'm struggling with is speeding up a distance calculation.
I have two groups of triangles which have been broken down to three points each of which has a 3d float vector (xyz). The calculations I'm using are :
diffverts = numpy.zeros( ( ntris*3, ntesttris*3, 3 ), dtype = 'float32')
diffverts += triverts.reshape(ntris*3, 1, 3 )
diffverts -= ttriverts.reshape(1, ntesttris*3, 3 )
vertdist = ( diffverts[:,:,0]**2 + diffverts[:,:,1]**2 + diffverts[:,:,2]**2 ) ** 0.5
this calculation is faster than :
diffverts = triverts.reshape(ntris*3, 1, 3 ) - ttriverts.reshape(1, ntesttris*3, 3 )
vertdist = ( diffverts[:,:,0]**2 + diffverts[:,:,1]**2 + diffverts[:,:,2]**2 ) ** 0.5
Is there a faster method to populate the diff vert part (which takes longest) and/or the distance part which is also quite time consuming? This code is called a lot of times due to the number of groups to test. Also, trying to do it just on indexes to the verts causes me other issues with further calculations when trying to get back to some boolean tests (i.e. this is only one of a set of calculations so keeping at the tri point level works best.
I'm using numpy and python
The problem is that brute force testing of all triangles versus eachother takes quadratic time. It is better to use a datastructure which is specialized to perform such computations. Luckily, scipy contains one.
Take a look at scipy.spatial.cKDTree. The help should be self-explanatory.
I think diffverts is taking up enough memory to cause cache misses. Unfortunately while this solution is very elegant, you're probably better off computing the whole distance in one go, to avoid having to save an n*m*3 array of intermediate values. As ugly as it is, I would just do nested for loops.
I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).