How to deal with functions that approach infinity in NumPy? - python

In a book on matplotlib I found a plot of 1/sin(x) that looks similar to this one that I made:
I used the domain
input = np.mgrid[0 : 1000 : 200j]
What confuses me here to an extreme extent is the fact that the sine function is just periodic. I don't understand why the maximal absolute value is decreasing. Plotting the same function in wolfram-alpha does not show this decreasing effect. Using a different step-amount
input = np.mgrid[0 : 1000 : 300j]
delivers a different result:
where we also have this decreasing tendency in maximal absolute value.
So my questions are:
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
Why does one see this decreasing tendency even though the function is purely periodic?

The period of the sine function is rather higher than what is plotted, so what you’re seeing is aliasing from the difference in the sampling frequency and some multiple of the true frequency. Since one of the roots is at 0, the smallest discrepancy that happens to exist between the first few samples and a
multiple of π itself scales linearly away from 0, producing a 1/x envelope.
In this example, input[5] is 5(1000/(200-1))=8π−0.007113, so the function is about −141 there, as shown. input[10] is of course 16π−0.014226, so that the function is about −70, and so on as long as the discrepancy is much smaller than π.
It’s possible for some one of the quasi-periodic sample sequences to eventually land even closer to nπ, producing a more complicated pattern like that in the second plot.

Why does one see this decreasing tendency even though the function is purely periodic?
Keep in mind that actually at every multiple of pi the function goes to infinity. And the size of the jump displayed actually only reflects the biggest value of the sampled values where the function still made sense. Therefore you get a big jump if you happen to sample a value were the function is big but not too big to be a float.
To be able to plot anything matplotlib throws away values that do not make sense. Like the np.nan you get at multiples of pi and ±np.infs you get for values very close to that. I believe what happens is that one step size away from zero you happen to get a value small enough not to be thrown away but still very large. While when you get to pi and multiples of it the largest value gets thrown away.
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
You get strange behaviour around the values where your function becomes unreasonable large. Just pick a ylimit to avoid plotting those crazy large values.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.transforms import Bbox
x = np.linspace(10**-10,50, 10**4)
plt.plot(x,1/np.sin(x))
plt.ylim((-15,15))

Related

How can I "untruncate" series data with Pandas?

I have some time series data that's effectively been recorded at a truncated precision relative to the rate it changes. This leads to a stairstep-like quality when I graph it. I'm using Pandas to manipulate and store the data. Is there a way I can use Pandas to smooth out the stairsteps, inferring extra precision from the time series data?
In more detail, here's a sample graph:
The green line represents recorded temperatures. My temperature sensor is only accurate to a tenth of a degree Celsius, but the rate of temperature change is significantly less than a tenth of a degree every recording interval.
I think it should be possible to infer extra precision based on how quickly the values are changing, but I'm not sure what the best way to do that is. I got an okay-ish looking result by using pandas.rolling_mean, but that uses a fixed window for the mean, even though different parts of the graph would benefit from different window sizes. It also shortens narrower peaks because of the relatively-wide window.
Ideally I'd like to get something continuous enough that I can take a derivative of the data and not have tremendously spiky results.
So what, if anything, in Pandas can help me get the results I'm looking for?
Well, here's what I've done so far. I get okay results from this, though I still think there should be something that feels less like a hack.
If we assume that the temperature sensor is reasonably accurate to a hundredth of a degree but is only reporting tenths of a degree, we can infer that when a reading changes from, say 0.1 to 0.2 the actual reading at that time is around 0.15. So I scan through the series, find all of the places where the value changes, set the value to the mean of the two, and set all other values to NaN. Then I use Series.interpolate to construct pleasant-looking curves that are, for the most part, at least as accurate as the original readings.
Here's the code:
def smooth_data(data, method='linear'):
data = data.copy().astype(np.float64)
for i0, i1 in zip(data.index[1:], data.index[2:]):
if data[i0] == data[i2]:
data[i0] = np.nan
else:
data[i0] = (data[i0] + data[i1]) / 2
return data.interpolate(method)
Here's what the same data looks like with this smoothing (and cubic interpolation):

How different methods of getting spectra in Python actually work?

I have several signals as columns in a pandas dataframe (each of them has some NaNs at the beginning or the end, because they all cover slightly different intervals). The signal has some sort of a trend (which basically means a high-wavelength portion with value in the order of X00) and some small wiggles (values in the order of X - X0). I would like to compute a spectrum of each of these columns, and I expect to see two peaks on it - one in the X-X0 and the other one in X00 (which suggests that I should work on a log scale).
However, the "spectra" that I produced using several different methods (scipy.signal.welch and numpy.fft.fft) do not look like the expected output. (peaks always at 20 and 40).
Here are several aspects that I don't understand:
Is there any time series processing inbuilt somewhere deep in these functions so that they actually don't work if I work with wavelengths instead of periods/frequencies?
I found the documentation rather confusing and not very helpful, in
particular when it comes to input parameters and the output. Do I
include the signal as it is or do I need to do some sort of
pre-processing before? Should I then include the sampling frequency
(i.e. 1/wavelength sampling interval, which, in my case, would be
let's say 1/0.01 per m) or the sampling interval (i.e. 0.01 m)? Does
the output show the same unit or 1/the unit? (I tried all
combinations of unit and 1/unit and none of them yielded a
reasonable result, so there is another problem here, too, but I
still am uncertain with this.)
Should I use yet another method/are these not suitable?
Not even sure if these are the right questions to ask, but I am afraid if I knew what question to ask, I would know the answer.
Disclaimer: I am not proficient at signal processing, so I am not actually sure if my issue is really with python or in deeper understanding of the problem.
Even if I try a very simple example, I don't understand the behaviour:
x = np.arange(0,10,0.01)
x = x * np.pi
y = np.sin(0.2*x)+np.cos(3*x)
freq, spec = sp.signal.welch(y,fs=(1/(0.01*pi)))
I would expect to see two peaks in the spectrum, one at ~15 and another one at ~2. Or if it is still in frequency, then at ~1/15 and 1/2. But this is what I get in the first case and this is what I get if I plot 1/freq instead of freq: - the 15 is even out of range! So I don't know what I am actually plotting.
Thanks a lot.

Find best interpolation nodes

I am studying physics and ran into a really interesting problem. I'm not an expert on programming so please take this into account while reading this.
I really hope that someone can help me with this problem because I struggle with this matter for about 2 months now and don't see any success.
So here is my Problem:
I got a bunch of data sets (more than 2 less than 20) from numerical calculations. The set is given by x against measurement values. I have a set of sensors and want to find the best positions x for my sensors such that the integral of the interpolation comes as close as possible to the integral of the numerical data set.
As this sounds like a typical mathematical problem I started to look for some theorems but I did not find anything.
So I started to write a python program based on the SLSQP minimizer. I chose this because it can handle bounds and constraints. (Note there is always a sensor at 0 and one at 1)
Constraints: the sensor array must stay sorted all the time such that x_i smaller than x_i+1 and the interval of x is normalized to [0,1].
Before doing an overall optimization I started to look for good starting points and searched for maximums, minimums and linear areas of my given data sets. But an optimization over 40 values turned out to deliver bad results.
In my second try I started to search for these points and defined certain areas. So I optimized each area with 1 to 40 sensors. Then I compared the results and decided which area is worth putting more sensors in. I the last step I wanted to do an overall optimization again. But these idea didn't seem to be the proper solution, too, because the optimization had convergence problem as well.
The big problem was, that my optimizer broke the boundaries. I covered this by interrupting the optimization, because once this boundaries were broken the result was not correct in the end. If this happens I reset my initial setup and a homogeneous distribution. After this there are normally no violence of boundaries but the results seems to be a homogeneous distribution, too, often this is obviously not the perfect distribution.
As my algorithm works for simple examples and dies for more complex data I think there is a general problem and not just some error in my coding. Does anyone have an idea how to move on or knows some theoretical stuff about this matter?
The attached plot show the areas in different colors. The function is shown at the bottom and the sensor positions are represented as dots. Dots at value y=1 are from the optimization with one sensors 2 represents the results of optimization with 2 variables. So as the program reaches higher sensor numbers the whole thing gets more and more homogeneous.
It is easy to see that if n is the number of sensors and n goes to infinity you have a total homogeneous distribution. But as far as I see this this should not happen for just 10 sensors.

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

mlpy - Dynamic Time Warping depends on x?

I am trying to get the distance between these two arrays shown below by DTW.
I am using the Python mlpy package that offers
dist, cost, path = mlpy.dtw_std(y1, y2, dist_only=False)
I understand that DTW does take care of the "shifting". In addition, as can be seen from above, the mlpy.dtw_std() only takes in 2 1-D arrays. So I expect that no matter how I left/right shift my curves, the dist returned by the function should never change.
However after shifting my green curve a bit to the right, the dist returned by mlpy.dtw_std() changes!
Before shifting: Python mlpy.dwt_std reports dist = 14.014
After shifting: Python mlpy.dwt_std reports dist = 38.078
Obviously, since the curves are still those two curves, I don't expect the distances to be different!
Why is it so? Where went wrong?
Let me reiterate what I have understood, please correct me if I am going wrong anywhere. I observe that in both your plots, your 1D series in blue is remaining identical, while green colored is getting stretched. How you are doing it, that you have explained it in the post on Sep 19 '13 at 9:36. Your premise is that because (1) DTW 'takes care' of time shift and (2) all that you are doing is stretching one time-series length-wise, not affecting y-values, (Inference:) you are expecting distance to remain the same.
There is a little missing link between [(1),(2)] and [(Inference)]. Which is, individual distance values corresponding to mappings WILL change as you change set of signals itself. And this will result into difference in the overall distance computation. Plot the warping paths, cost grid to see it for yourself.
Let's take an oversimplified case...
Let
a=range(0,101,5) = [0,5,10,15...95, 100]
and b=range(0,101,5) = [0,5,10,15...95, 100].
Now intuitively speaking, you/I would expect one to one correspondence between 2 signals (for DTW mapping), and distance for all of the mappings to be 0, signals being identically looking.
Now if we make, b=range(0,101,4) = [0,4,8,12...96,100],
DTW mapping between a and b still would start with a's 0 getting mapped to b's 0, and end at a's 100 getting mapped to b's 100 (boundary constraints). Also, because DTW 'takes care' of time shift, I would also expect 20's, 40's, 60's and 80's of the two signals to be mapped with one another. (I haven't tried DTWing these two myself, saying it from intuition, so please check. There is little possibility of non-intuitive warpings taking place as well, depending on step patterns allowed / global constraints, but let's go with intuitive warpings for the moment for the ease of understanding / sake of simplicity).
For the remaining data points, clearly, distances corresponding to mapping are now non-zero, therefore the overall distance too is non-zero. Our distance/overall cost value has changed from zero to something that is non-zero.
Now, this was the case when our signals were too simplistic, linearly increasing. Imagine the variabilities that will come into picture when you have real life non-monotonous signals, and need to find time-warping between them. :)
(PS: Please don't forget to upvote answer :D). Thanks.
Obviously, the curves are not identical, and therefore the distance function must not be 0 (otherwise, it is not a distance by definition).
What IS "relatively large"? The distance probably is not infinite, is it?
140 points in time, each with a small delta, this still adds up to a non-zero number.
The distance "New York" to "Beijing" is roughly 11018 km. Or 1101800000 mm.
The distance to Alpha Centauri is small, just 4.34 lj. That is the nearest other stellar system to us...
Compare with the distance to a non-similar series; that distance should be much larger.

Categories