How different methods of getting spectra in Python actually work? - python

I have several signals as columns in a pandas dataframe (each of them has some NaNs at the beginning or the end, because they all cover slightly different intervals). The signal has some sort of a trend (which basically means a high-wavelength portion with value in the order of X00) and some small wiggles (values in the order of X - X0). I would like to compute a spectrum of each of these columns, and I expect to see two peaks on it - one in the X-X0 and the other one in X00 (which suggests that I should work on a log scale).
However, the "spectra" that I produced using several different methods (scipy.signal.welch and numpy.fft.fft) do not look like the expected output. (peaks always at 20 and 40).
Here are several aspects that I don't understand:
Is there any time series processing inbuilt somewhere deep in these functions so that they actually don't work if I work with wavelengths instead of periods/frequencies?
I found the documentation rather confusing and not very helpful, in
particular when it comes to input parameters and the output. Do I
include the signal as it is or do I need to do some sort of
pre-processing before? Should I then include the sampling frequency
(i.e. 1/wavelength sampling interval, which, in my case, would be
let's say 1/0.01 per m) or the sampling interval (i.e. 0.01 m)? Does
the output show the same unit or 1/the unit? (I tried all
combinations of unit and 1/unit and none of them yielded a
reasonable result, so there is another problem here, too, but I
still am uncertain with this.)
Should I use yet another method/are these not suitable?
Not even sure if these are the right questions to ask, but I am afraid if I knew what question to ask, I would know the answer.
Disclaimer: I am not proficient at signal processing, so I am not actually sure if my issue is really with python or in deeper understanding of the problem.
Even if I try a very simple example, I don't understand the behaviour:
x = np.arange(0,10,0.01)
x = x * np.pi
y = np.sin(0.2*x)+np.cos(3*x)
freq, spec = sp.signal.welch(y,fs=(1/(0.01*pi)))
I would expect to see two peaks in the spectrum, one at ~15 and another one at ~2. Or if it is still in frequency, then at ~1/15 and 1/2. But this is what I get in the first case and this is what I get if I plot 1/freq instead of freq: - the 15 is even out of range! So I don't know what I am actually plotting.
Thanks a lot.

Related

How to deal with functions that approach infinity in NumPy?

In a book on matplotlib I found a plot of 1/sin(x) that looks similar to this one that I made:
I used the domain
input = np.mgrid[0 : 1000 : 200j]
What confuses me here to an extreme extent is the fact that the sine function is just periodic. I don't understand why the maximal absolute value is decreasing. Plotting the same function in wolfram-alpha does not show this decreasing effect. Using a different step-amount
input = np.mgrid[0 : 1000 : 300j]
delivers a different result:
where we also have this decreasing tendency in maximal absolute value.
So my questions are:
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
Why does one see this decreasing tendency even though the function is purely periodic?
The period of the sine function is rather higher than what is plotted, so what you’re seeing is aliasing from the difference in the sampling frequency and some multiple of the true frequency. Since one of the roots is at 0, the smallest discrepancy that happens to exist between the first few samples and a
multiple of π itself scales linearly away from 0, producing a 1/x envelope.
In this example, input[5] is 5(1000/(200-1))=8π−0.007113, so the function is about −141 there, as shown. input[10] is of course 16π−0.014226, so that the function is about −70, and so on as long as the discrepancy is much smaller than π.
It’s possible for some one of the quasi-periodic sample sequences to eventually land even closer to nπ, producing a more complicated pattern like that in the second plot.
Why does one see this decreasing tendency even though the function is purely periodic?
Keep in mind that actually at every multiple of pi the function goes to infinity. And the size of the jump displayed actually only reflects the biggest value of the sampled values where the function still made sense. Therefore you get a big jump if you happen to sample a value were the function is big but not too big to be a float.
To be able to plot anything matplotlib throws away values that do not make sense. Like the np.nan you get at multiples of pi and ±np.infs you get for values very close to that. I believe what happens is that one step size away from zero you happen to get a value small enough not to be thrown away but still very large. While when you get to pi and multiples of it the largest value gets thrown away.
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
You get strange behaviour around the values where your function becomes unreasonable large. Just pick a ylimit to avoid plotting those crazy large values.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.transforms import Bbox
x = np.linspace(10**-10,50, 10**4)
plt.plot(x,1/np.sin(x))
plt.ylim((-15,15))

How to programatically report reasonable local maximas and minimas in a data set?

I have an array of y-values which are evenly spaced along the x-axis and I need to programmatically find the "troughs". I think either Octave or Python3 are good language choices for this problem as I know both have strong math capabilities.
I thought to interpolate the function and look for where the derivatives are 0, but that would require my human eyes to first analyze the resulting graph to know where the maxima and minima already were, but I need this entire thing to be automatic; as to work with an arbitrary dataset.
It dawned on me that this problem likely had an existing solution in a Python3 or Octave function or library, but I could not find one. Does there exist a library to automatically report local maximas and minimas within a dataset?
More Info
My current planned approach is to implement a sort of "n-day moving average" with a threshold. After initializing the first day moving average, I'll watch for the next moving average to move above or below it by a threshold. If it moves higher then I'll consider myself in a "rising" period. If it moves lower then I'm in a "falling" period. While I'm in a rising period, I'll update the maximum observed moving average until the current moving average is sufficiently below the previous maximum.
At this point, I'll consider myself in a "falling" period. I'll lock in the point where the moving average was previously highest, and then repeat except using inverse logic for the "falling" period.
It seemed to me that this is probably a pretty common problem though, so I'm sure there's an existing solution.
Python answer:
This is a common problem, with existing solutions.
Examples include:
peakutils
scipy find_peaks
see also this question
in all cases, you'll have to test your parameters to have what you want.
Octave answer:
I believe immaximas and imregionalmax do exactly what you are looking for (depending on which of the two it is exactly that you are looking for - have a look at their documentation to see the difference).
These are part of the image package, but will obviously work on 1D signals too.
For more 'functional' zero-finding functions, there is also fzero etc.

Is there an alternative algorithm to do the following task in Python(Astronomy pipeline)

First a bit of background.This is a problem that we are facing while making the software pipeline of a newly launched spacecraft.
The telescopes on board are looking at a specific target. However as you might expect the the telescope is not exactly stable an wobbles slightly. Hence at different time instants it is looking at SLIGHTLY different portions of the sky.
To fix for this we have a lab made template (basically a 2-d array of zeros and ones) that tells us which portion of the sky is being looked at a specific time instant(lets say t). It looks like this.
Here the white portion signifies the part of the telescope that is actually observing. This array is actually 2400x2400(for accuracy. Cant be reduced because it will cause loss of information. Also it is not an array of 0s and 1s, but instead real numbers because of other effects). Now knowing the wobbles of the telescope, we also know that this template will wobble by the same amount. Hence we need to shift(using np.roll) the array in either by x or y direction(sometimes even rotate if the spacecraft is rotating) accordingly and accumulate(so that we know which portion of the sky has been observed for how long. However this process is EXTREMELY time consuming and lengthy(even with numpy implementation of add and roll). Moreover we need to do this in the pipeline at least 500 times a second. Is there a way to avoid it ? We are looking for an algorithmic solution maybe a fundamentally new way of approaching the whole problem. Any help is welcome. Also if any part is unclear let me know. I will happily explain it further.
A previous question related to the same topic:
Click Here
We are implementing the pipeline in python(I know a bad choice probably)
If you want to use shifted array contents for some calculation (apply mask etc), you don't need to move it physically - just use modified index scheme to address the same elements.
For example, to virtually shift array by dx to the right, use in calculations
A[y][x-dx] instead of A[y][x]
This method becomes some more complex when rotation takes place, but still solvable (one should compare time for real array rotation and coordinate recalculations)

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

how to generate all possible combinations of a 14x10 matrix containing only 1's and 0's

I'm working on a problem and one solution would require an input of every 14x10 matrix that is possible to be made up of 1's and 0's... how can I generate these so that I can input every possible 14x10 matrix into another function? Thank you!
Added March 21: It looks like I didn't word my post appropriately. Sorry. What I'm trying to do is optimize the output of 10 different production units (given different speeds and amounts of downtime) for several scenarios. My goal is to place blocks of downtime to minimized the differences in production on a day-to-day basis. The amount of downtime and frequency each unit is allowed is given. I am currently trying to evaluate a three week cycle, meaning every three weeks each production unit is taken down for a given amount of hours. I was asking the computer to determine the order the units would be taken down based on the constraint that the lines come down only once every 3 weeks and the difference in daily production is the smallest possible. My first approach was to use Excel (as I tried to describe above) and it didn't work (no suprise there)... where 1- running, 0- off and when these are summed to calculate production. The calculated production is subtracted from a set max daily production. Then, these differences were compared going from Mon-Tues, Tues-Wed, etc for a three week time frame and minimized using solver. My next approach was to write a Matlab code where the input was a tolerance (set allowed variation day-to-day). Is there a program that already does this or an approach to do this easiest? It seems simple enough, but I'm still thinking through the different ways to go about this. Any insight would be much appreciated.
The actual implementation depends heavily on how you want to represent matrices… But assuming the matrix can be represented by a 14 * 10 = 140 element list:
from itertools import product
for matrix in product([0, 1], repeat=140):
# ... do stuff with the matrix ...
Of course, as other posters have noted, this probably isn't what you want to do… But if it really is what you want to do, that's the best code (given your requirements) to do it.
Generating Every possible matrix of 1's and 0's for 14*10 would generate 2**140 matrixes. I don't believe you would have enough lifetime for this. I don't know, if the sun would still shine before you finish that. This is why it is impossible to generate all those matrices. You must look for some other solution, this looks like a brute force.
This is absolutely impossible! The number of possible matrices is 2140, which is around 1.4e42. However, consider the following...
If you were to generate two 14-by-10 matrices at random, the odds that they would be the same are 1 in 1.4e42.
If you were to generate 1 billion unique 14-by-10 matrices, then the odds that the next one you generate would be the same as one of those would still be exceedingly slim: 1 in 1.4e33.
The default random number stream in MATLAB uses a Mersenne twister algorithm that has a period of 219936-1. Therefore, the random number generator shouldn't start repeating itself any time this eon.
Your approach should be thus:
Find a computer no one ever wants to use again.
Give it as much storage space as possible to save your results.
Install MATLAB on it and fire it up.
Start computing matrices at random like so:
while true
newMatrix = randi([0 1],14,10);
%# Process the matrix and output your results to disk
end
Walk away
Since there are so many combinations, you don't have to compare newMatrix with any of the previous matrices since the length of time before a repeat is likely to occur is astronomically large. Your processing is more likely to stop due to other reasons first, such as (in order of likely occurrence):
You run out of disk space to store your results.
There's a power outage.
Your computer suffers a fatal hardware failure.
You pass away.
The Earth passes away.
The Universe dies a slow heat death.
NOTE: Although I injected some humor into the above answer, I think I have illustrated one useful alternative. If you simply want to sample a small subset of the possible combinations (where even 1 billion could be considered "small" due to the sheer number of combinations) then you don't have to go through the extra time- and memory-consuming steps of saving all of the matrices you've already processed and comparing new ones to it to make sure you aren't repeating matrices. Since the odds of repeating a combination are so low, you could safely do this:
for iLoop = 1:whateverBigNumberYouWant
newMatrix = randi([0 1],14,10); %# Generate a new matrix
%# Process the matrix and save your results
end
Are you sure you want every possible 14x10 matrix? There are 140 elements in each matrix, and each element can be on or off. Therefore there are 2^140 possible matrices. I suggest you reconsider what you really want.
Edit: I noticed you mentioned in a comment that you are trying to minimize something. There is an entire mathematical field called optimization devoted to doing this type of thing. The reason this field exists is because quite often it is not possible to exhaustively examine every solution in anything resembling a reasonable amount of time.
Trying this:
import numpy
for i in xrange(int(1e9)): a = numpy.random.random_integers(0,1,(14,10))
(which is much, much, much smaller than what you require) should be enough to convince you that this is not feasible. It also shows you how to calculate one, or few, such random matrices even up to a million is pretty fast).
EDIT: changed to xrange to "improve speed and memory requirements" :)
You don't have to iterate over this:
def everyPossibleMatrix(x,y):
N=x*y
for i in range(2**N):
b="{:0{}b}".format(i,N)
yield '\n'.join(b[j*x:(j+1)*x] for j in range(y))
Depending on what you want to accomplish with the generated matrices, you might be better off generating a random sample and running a number of simulations. Something like:
matrix_samples = []
# generate 10 matrices
for i in range(10):
sample = numpy.random.binomial(1, .5, 14*10)
sample.shape = (14, 10)
matrix_samples.append(sample)
You could do this a number of times to see how results vary across simulations. Of course, you could also modify the code to ensure that there are no repeats in a sample set, again depending on what you're trying to accomplish.
Are you saying that you have a table with 140 cells and each value can be 1 or 0 and you'd like to generate every possible output? If so, you would have 2^140 possible combinations...which is quite a large number.
Instead of just suggesting the this is unfeasible, I would suggest considering a scheme that samples the important subset of all possible combinations instead of applying a brute force approach. As one of your replies suggested, you are doing minimization. There are numerical techniques to do this such as simulated annealing, monte carlo sampling as well as traditional minimization algorithms. You might want to look into whether one is appropriate in your case.
I was actually much more pessimistic to begin with, but consider:
from math import log, e
def timeInYears(totalOpsNeeded=2**140, currentOpsPerSecond=10**9, doublingPeriodInYears=1.5):
secondsPerYear = 365.25 * 24 * 60 * 60
doublingPeriodInSeconds = doublingPeriodInYears * secondsPerYear
k = log(2,e) / doublingPeriodInSeconds # time-proportionality constant
timeInSeconds = log(1 + k*totalOpsNeeded/currentOpsPerSecond, e) / k
return timeInSeconds / secondsPerYear
if we assume that computer processing power continues to double every 18 months, and you can currently do a billion combinations per second (optimistic, but for sake of argument) and you start today, your calculation will be complete on or about April 29th 2137.
Here is an efficient way to do get started Matlab:
First generate all 1024 possible rows of length 10 containing only zeros and ones:
dec2bin(0:2^10-1)
Now you have all possible rows, and you can sample from them as you wish. For example by calling the following line a few times:
randperm(1024,14)

Categories