How can I "untruncate" series data with Pandas?

How can I "untruncate" series data with Pandas? - python

I have some time series data that's effectively been recorded at a truncated precision relative to the rate it changes. This leads to a stairstep-like quality when I graph it. I'm using Pandas to manipulate and store the data. Is there a way I can use Pandas to smooth out the stairsteps, inferring extra precision from the time series data?
In more detail, here's a sample graph:
The green line represents recorded temperatures. My temperature sensor is only accurate to a tenth of a degree Celsius, but the rate of temperature change is significantly less than a tenth of a degree every recording interval.
I think it should be possible to infer extra precision based on how quickly the values are changing, but I'm not sure what the best way to do that is. I got an okay-ish looking result by using pandas.rolling_mean, but that uses a fixed window for the mean, even though different parts of the graph would benefit from different window sizes. It also shortens narrower peaks because of the relatively-wide window.
Ideally I'd like to get something continuous enough that I can take a derivative of the data and not have tremendously spiky results.
So what, if anything, in Pandas can help me get the results I'm looking for?

Well, here's what I've done so far. I get okay results from this, though I still think there should be something that feels less like a hack.
If we assume that the temperature sensor is reasonably accurate to a hundredth of a degree but is only reporting tenths of a degree, we can infer that when a reading changes from, say 0.1 to 0.2 the actual reading at that time is around 0.15. So I scan through the series, find all of the places where the value changes, set the value to the mean of the two, and set all other values to NaN. Then I use Series.interpolate to construct pleasant-looking curves that are, for the most part, at least as accurate as the original readings.
Here's the code:
def smooth_data(data, method='linear'):
data = data.copy().astype(np.float64)
for i0, i1 in zip(data.index[1:], data.index[2:]):
if data[i0] == data[i2]:
data[i0] = np.nan
else:
data[i0] = (data[i0] + data[i1]) / 2
return data.interpolate(method)
Here's what the same data looks like with this smoothing (and cubic interpolation):

Related

How to deal with functions that approach infinity in NumPy?

In a book on matplotlib I found a plot of 1/sin(x) that looks similar to this one that I made:
I used the domain
input = np.mgrid[0 : 1000 : 200j]
What confuses me here to an extreme extent is the fact that the sine function is just periodic. I don't understand why the maximal absolute value is decreasing. Plotting the same function in wolfram-alpha does not show this decreasing effect. Using a different step-amount
input = np.mgrid[0 : 1000 : 300j]
delivers a different result:
where we also have this decreasing tendency in maximal absolute value.
So my questions are:
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
Why does one see this decreasing tendency even though the function is purely periodic?

The period of the sine function is rather higher than what is plotted, so what you’re seeing is aliasing from the difference in the sampling frequency and some multiple of the true frequency. Since one of the roots is at 0, the smallest discrepancy that happens to exist between the first few samples and a
multiple of π itself scales linearly away from 0, producing a 1/x envelope.
In this example, input[5] is 5(1000/(200-1))=8π−0.007113, so the function is about −141 there, as shown. input[10] is of course 16π−0.014226, so that the function is about −70, and so on as long as the discrepancy is much smaller than π.
It’s possible for some one of the quasi-periodic sample sequences to eventually land even closer to nπ, producing a more complicated pattern like that in the second plot.

Why does one see this decreasing tendency even though the function is purely periodic?
Keep in mind that actually at every multiple of pi the function goes to infinity. And the size of the jump displayed actually only reflects the biggest value of the sampled values where the function still made sense. Therefore you get a big jump if you happen to sample a value were the function is big but not too big to be a float.
To be able to plot anything matplotlib throws away values that do not make sense. Like the np.nan you get at multiples of pi and ±np.infs you get for values very close to that. I believe what happens is that one step size away from zero you happen to get a value small enough not to be thrown away but still very large. While when you get to pi and multiples of it the largest value gets thrown away.
How can I make a plot like this consistent i.e. independent of step-size/step-amount?
You get strange behaviour around the values where your function becomes unreasonable large. Just pick a ylimit to avoid plotting those crazy large values.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.transforms import Bbox
x = np.linspace(10**-10,50, 10**4)
plt.plot(x,1/np.sin(x))
plt.ylim((-15,15))

Find the time that 90% of tickets are processed in?

My boss wants metrics on our ticket processing system, and one of the metrics he wants is "the 90% time" which he defines as the time it takes 90% of the tickets to be processed. I guess he's considering that 10% are anomalous can be ignored. I would like this to at least approach some statistical validity. So I've got a list of the times that I throw into a numpy array. This is the code I've come up with.
import numpy as np
inliers = data[data<np.percentile(data, 90)]
ninety_time = inliers.max()
Is this valid? Is there a better way?

Percentiles are a statistically perfectly valid approach. They are used to provide robust descriptions of the data. For example the 50% percentile is the median, and box-plots typically show the 25%, 50%, and 75% percentiles to give an idea of the range covered by data.
The 90% percentile can be seen as a rather naive and rough estimate of the maximum value that is less vulnerable to outliers than the actual max-value. (Obviously, it is somewhat biased - it will always be less than the true maximum.) Use this interpretation with care. It's safest to see the 90% percentile as what it is - a value where 90% of the data below and 10% above.
Your code is somewhat redundant as the percentile(data, 90) returns the value where 90% of the elements in data are lower or equal. So I would say this is exactly the 90% time and there is no need to compute the value for <90%. For a large number of samples and continous values the difference between <=90% and <90% will vanish anyway.

Find best interpolation nodes

I am studying physics and ran into a really interesting problem. I'm not an expert on programming so please take this into account while reading this.
I really hope that someone can help me with this problem because I struggle with this matter for about 2 months now and don't see any success.
So here is my Problem:
I got a bunch of data sets (more than 2 less than 20) from numerical calculations. The set is given by x against measurement values. I have a set of sensors and want to find the best positions x for my sensors such that the integral of the interpolation comes as close as possible to the integral of the numerical data set.
As this sounds like a typical mathematical problem I started to look for some theorems but I did not find anything.
So I started to write a python program based on the SLSQP minimizer. I chose this because it can handle bounds and constraints. (Note there is always a sensor at 0 and one at 1)
Constraints: the sensor array must stay sorted all the time such that x_i smaller than x_i+1 and the interval of x is normalized to [0,1].
Before doing an overall optimization I started to look for good starting points and searched for maximums, minimums and linear areas of my given data sets. But an optimization over 40 values turned out to deliver bad results.
In my second try I started to search for these points and defined certain areas. So I optimized each area with 1 to 40 sensors. Then I compared the results and decided which area is worth putting more sensors in. I the last step I wanted to do an overall optimization again. But these idea didn't seem to be the proper solution, too, because the optimization had convergence problem as well.
The big problem was, that my optimizer broke the boundaries. I covered this by interrupting the optimization, because once this boundaries were broken the result was not correct in the end. If this happens I reset my initial setup and a homogeneous distribution. After this there are normally no violence of boundaries but the results seems to be a homogeneous distribution, too, often this is obviously not the perfect distribution.
As my algorithm works for simple examples and dies for more complex data I think there is a general problem and not just some error in my coding. Does anyone have an idea how to move on or knows some theoretical stuff about this matter?
The attached plot show the areas in different colors. The function is shown at the bottom and the sensor positions are represented as dots. Dots at value y=1 are from the optimization with one sensors 2 represents the results of optimization with 2 variables. So as the program reaches higher sensor numbers the whole thing gets more and more homogeneous.
It is easy to see that if n is the number of sensors and n goes to infinity you have a total homogeneous distribution. But as far as I see this this should not happen for just 10 sensors.

Why should I discard half of what a FFT returns?

Looking at this answer:
Python Scipy FFT wav files
The technical part is obvious and working, but I have two theoretical questions (the code mentioned is below):
1) Why do I have to normalized (b=...) the frames? What would happen if I used the raw data?
2) Why should I only use half of the FFT result (d=...)?
3) Why should I abs(c) the FFT result?
Perhaps I'm missing something due to inadequate understanding of WAV format or FFT, but while this code works just fine, I'd be glad to understand why it works and how to make the best use of it.
Edit: in response to the comment by #Trilarion :
I'm trying to write a simple, not 100% accurate but more like a proof-of-concept Speaker Diarisation in Python. That means taking a wav file (right now I am using this one for my tests) and in each second (or any other resolution) say if the speaker is person #1 or person #2. I know in advance that these are 2 persons and I am not trying to link them to any known voice signatures, just to separate. Right now take each second, FFT it (and thus get a list of frequencies), and cluster them using KMeans with the number of clusters between 2 and 4 (A, B [,Silence [,A+B]]).
I'm still new to analyzing wav files and audio in general.
import matplotlib.pyplot as plt
from scipy.io import wavfile # get the api
fs, data = wavfile.read('test.wav') # load the data
a = data.T[0] # this is a two channel soundtrack, I get the first track
b=[(ele/2**8.)*2-1 for ele in a] # this is 8-bit track, b is now normalized on [-1,1)
c = sfft.fft(b) # create a list of complex number
d = len(c)/2 # you only need half of the fft list
plt.plot(abs(c[:(d-1)]),'r')
plt.show()

To address these in order:
1) You don't need to normalize, but the input normalization is close to the raw structure of the digitized waveform so the numbers are unintuitive. For example, how loud is a value of 67? It's easier to normalize it to be in the range -1 to 1 to interpret the values. (But if you wanted to implement a filter, for example, where you did an FFT, modified the FFT values, followed by an IFFT, normalizing would be an unnecessary hassle.)
2) and 3) are similar in that they both have to do with the math living primarily in the complex numbers space. That is, FFTs take a waveform of complex numbers (eg, [.5+.1j, .4+.7j, .4+.6j, ...]) to another sequence of complex numbers.
So in detail:
2) It turns out that if the input waveform is real instead of complex, then the FFT has a symmetry about 0, so only the values that have a frequency >=0 are uniquely interesting.
3) The values output by the FFT are complex, so they have a Re and Im part, but this can also be expressed as a magnitude and phase. For audio signals, it's usually the magnitude that's the most interesting, because this is primarily what we hear. Therefore people often use abs (which is the magnitude), but the phase can be important for different problems as well.

That depends on what you're trying to do. It looks like you're only looking to plot the spectral density and then it's OK to do so.
In general the coefficient in the DFT is depending on the phase for each frequency so if you want to keep phase information you have to keep the argument of the complex numbers.
The symmetry you see is only guaranteed if the input is real numbered sequence (IIRC). It's related to the mirroring distortion you'll get if you have frequencies above the Nyquist frequency (half the sampling frequency), the original frequency shows up in the DFT, but also the mirrored frequency.
If you're going to inverse DFT you should keep the full data and also keep the arguments of the DFT-coefficients.

How can I discriminate the falling edge of a signal with Python?

I am working on discriminating some signals for the calculation of the free-period oscillation and damping ratio of a spring-mass system (seismometer). I am using Python as the main processing program. What I need to do is import this signal, parse the signal and find the falling edge, then return a list of the peaks from the tops of each oscillation as a list so that I can calculate the damping ratio. The calculation of the free-period is fairly straightforward once I've determined the location of the oscillation within the dataset.
Where I am mostly hung up are how to parse through the list, identify the falling edge, and then capture each of the Z0..Zn elements. The oscillation frequency can be calculated using an FFT fairly easily, once I know where that falling edge is, but if I process the whole file, a lot of energy from the energizing of the system before the release can sometimes force the algorithm to kick out an ultra-low frequency that represents the near-DC offset, rather than the actual oscillation. (It's especially a problem at higher damping ratios where there might be only four or five measurable rebounds).
Has anyone got some ideas on how I can go about this? Right now, the code below uses the arbitrarily assigned values for the signal in the screenshot. However I need to have code calculate those values. Also, I haven't yet determined how to create my list of peaks for the calculation of the damping ratio h. Your help in getting some ideas together for solving this would be very welcome. Due to the fact I have such a small Stackoverflow reputation, I have included my signal in a sample screenshot at the following link:
(Boy, I hope this works!)
Tycho's sample signal -->
https://github.com/tychoaussie/Sigcal_v1/blob/066faca7c3691af3f894310ffcf3bbb72d730601/Freeperiod_Damping%20Ratio.jpg
##########################################################
import os, sys, csv
from scipy import signal
from scipy.integrate import simps
import pylab as plt
import numpy as np
import scipy as sp
#
# code goes here that imports the data from the csv file into lists.
# One of those lists is called laser, a time-history list in counts from an analog-digital-converter
# sample rate is 130.28 samples / second.
#
#
# Find the period of the observed signal
#
delta = 0.00767 # Calculated elsewhere in code, represents 130.28 samples/sec
# laser is a list with about 20,000 elements representing time-history data from a laser position sensor
# The release of the mass and system response occurs starting at sample number 2400 in this particular instance.
sense = signal.detrend(laser[2400:(2400+8192)]) # Remove the mean of the signal
N = len(sense)
W = np.fft.fft(sense)
freq = np.fft.fftfreq(len(sense),delta) # First value represents the number of samples and delta is the sample rate
#
# Take the sample with the largest amplitude as our center frequency.
# This only works if the signal is heavily sinusoidal and stationary
# in nature, like our calibration data.
#
idx = np.where(abs(W)==max(np.abs(W)))[0][-1]
Frequency = abs(freq[idx]) # Frequency in Hz
period = 1/(Frequency*delta) # represents the number of samples for one cycle of the test signal.
#
# create an axis representing time.
#
dt = [] # Create an x axis that represents elapsed time in seconds. delta = seconds per sample, i represents sample count
for i in range(0,len(sensor)):
dt.append(i*delta)
#
# At this point, we know the frequency interval, the delta, and we have the arrays
# for signal and laser. We can now discriminate out the peaks of each rebound and use them to process the damping ratio of either the 'undamped' system or the damping ratio of the 'electrically damped' system.
#
print 'Frequency calcuated to ',Frequency,' Hz.'

Here's a somewhat unconventional idea, which I think is might be fairly robust and doesn't require a lot of heuristics and guesswork. The data you have is really high quality and fits a known curve so that helps a lot here. Here I assume the "good part" of your curve has the form:
V = a * exp(-γ * t) * cos(2 * π * f * t + φ) + V0 # [Eq1]
V: voltage
t: time
γ: damping constant
f: frequency
a: starting amplitude
φ: starting phase
V0: DC offset
Outline of the algorithm
Getting rid of the offset
Firstly, calculate the derivative numerically. Since the data quality is quite high, the noise shouldn't affect things too much.
The voltage derivative V_deriv has the same form as the original data: same frequency and damping constant, but with a different phase ψ and amplitude b,
V_deriv = b * exp(-γ * t) * cos(2 * π * f * t + ψ) # [Eq2]
The nice thing is that this will automatically get rid of your DC offset.
Alternative: This step isn't completely needed – the offset is a relatively minor complication to the curve fitting, since you can always provide a good guess for an offset by averaging the entire curve. Trade-off is you get a better signal-to-noise if you don't use the derivative.
Preliminary guesswork
Now consider the curve of the derivative (or the original curve if you skipped the last step). Start with the data point on the very far right, and then follow the curve leftward as many oscillations as you can until you reach an oscillation whose amplitude is greater than some amplitude threshold A. You want to find a section of the curve containing one oscillation with a good signal-to-noise ratio.
How you determine the amplitude threshold is difficult to say. It depends on how good your sensors are. I suggest leaving this as a parameter for tweaking later.
Curve-fitting
Now that you have captured one oscillation, you can do lots of easy things: estimate the frequency f and damping constant γ. Using these initial estimates, you can perform a nonlinear curve fit of all the data to the right of this oscillation (including the oscillation itself).
The function that you fit is [Eq2] if you're using the derivative, or [Eq1] if you use the original curve (but they are the same functions anyway, so rather moot point).
Why do you need these initial estimates? For a nonlinear curve fit to a decaying wave, it's critical that you give a good initial guess for the parameters, especially the frequency and damping constant. The other parameters are comparatively less important (at least that's my experience), but here's how you can get the others just in case:
The phase is 0 if you start from a maximum, or π if you start at a minimum.
You can guess the starting amplitude too, though the fitting algorithm will typically do fine even if you just set it to 1.
Finding the edge
The curve fit should be really good at this point – you can tell because discrepancy between your curve and the fit is very low. Now the trick is to attempt to increase the domain of the curve to the left.
If you stay within the sinusoidal region, the curve fit should remain quite good (how you judge the "goodness" requires some experimenting). As soon as you hit the flat region of the curve, however, the errors will start to increase dramatically and the parameters will start to deviate. This can be used to determine where the "good" data ends.
You don't have to do this point by point though – that's quite inefficient. A binary search should work quite well here (possibly the "one-sided" variation of it).
Summary
You don't have to follow this exact procedure, but the basic gist is that you can perform some analysis on a small part of the data starting on the very right, and then gradually increase the time domain until you reach a point where you find that the errors are getting larger rather than smaller as they ought to be.
Furthermore, you can also combine different heuristics to see if they agree with each other. If they don't, then it's probably a rather sketchy data point that requires some manual intervention.
Note that one particular advantage of the algorithm I sketched above is that you will get the results you want (damping constant and frequency) as part of the process, along with uncertainty estimates.
I neglected to mention most of the gory mathematical and algorithmic details so as to provide a general overview, but if needed I can provide more detail.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.