Quite often I have to work with a bunch of noisy, somewhat correlated time series. Sometimes I need some mock data to test my code, or to provide some sample data for a question on Stack Overflow. I usually end up either loading some similar dataset from a different project, or just adding a few sine functions and noise and spending some time to tweak it.
What's your approach? How do you generate noisy signals with certain specs? Have I just overlooked some blatantly obvious standard package that does exactly this?
The features I would generally like to get in my mock data:
Varying noise levels over time
Some history in the signal (like a random walk?)
Periodicity in the signal
Being able to produce another time series with similar (but not exactly the same) features
Maybe a bunch of weird dips/peaks/plateaus
Being able to reproduce it (some seed and a few parameters?)
I would like to get a time series similar to the two below [A]:
I usually end up creating a time series with a bit of code like this:
import numpy as np
n = 1000
limit_low = 0
limit_high = 0.48
my_data = np.random.normal(0, 0.5, n) \
+ np.abs(np.random.normal(0, 2, n) \
* np.sin(np.linspace(0, 3*np.pi, n)) ) \
+ np.sin(np.linspace(0, 5*np.pi, n))**2 \
+ np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(my_data) - min(my_data))
my_data = my_data * scaling
my_data = my_data + (limit_low - min(my_data))
Which results in a time series like this:
Which is something I can work with, but still not quite what I want. The problem here is mainly that:
it doesn't have the history/random walk aspect
it's quite a bit of code and tweaking (this is especially a problem if i want to share a sample time series)
I need to retweak the values (freq. of sines etc.) to produce another similar but not exactly the same time series.
[A]: For those wondering, the time series depicted in the first two images is the traffic intensity at two points along one road over three days (midnight to 6 am is clipped) in cars per second (moving hanning window average over 2 min). Resampled to 1000 points.
Have you looked into TSimulus? By using Generators, you should be able generate data with specific patterns, periodicity, and cycles.
The TSimulus project provides tools for specifying the shape of a time series (general patterns, cycles, importance of the added noise, etc.) and for converting this specification into time series values.
Otherwise, you can try "drawing" the data yourself and exporting those data points using Time Series Maker.
Related
I am writing a python script for some geometrical data manipulation (calculating motion trajectories for a multi-drive industrial machine). Generally, the idea is that there is a given shape (let's say - an ellipse, but it general case it can be any convex shape, defined with a series of 2D points), which is rotated and it's uppermost tangent point must be followed. I don't have a problem with the latter part but I need a little hint with the 2D shape preparation.
Let's say that the ellipse was defined with too little points, for example - 25. (As I said, ultimately this can be any shape, for example a rounded hexagon). To maintain necessary precision I need far more points (let's say - 1000), preferably equally distributed over whole shape or with higher density of points near corners, sharp curves, etc.
I have a few things ringing in my head, I guess that DFT (FFT) would be a good starting point for this resampling, analyzing the scipy.signal.resample() I have found out that there are far more functions in the scipy.signal package which sound promising to me...
What I'm asking for is a suggestion which way I should follow, what tool I should try for this job, which may be the most suitable. Maybe there is a tool meant exactly for what I'm looking for or maybe I'm overthinking this and one of the implementations of FFT like resample() will work just fine (of course, after some adjustments at the starting and ending point of the shape to make sure it's closing without issues)?
Scipy.signal sounds promising, however, as far as I understand, it is meant to work with time series data, not geometrical data - I guess this may cause some problems as my data isn't a function (in a mathematical understanding).
Thanks and best regards!
As far as I understood, what you want is to get an interpolated version of your original data.
The DFT (or FFT) will not achieve this purpose, since it will perform an Fourier Transform (which is not what you want).
Talking theoretically, what you need to interpolate your data is to define a function to calculate the result in the new-data-points.
So, let's say your data contains 5 points, in which one you have a 1D (to simplify) number stored, representing your data, and you want a new array with 10 points, filled with the linear-interpolation of your original data.
Using numpy.interp:
import numpy as np
original_data = [2, 0, 3, 5, 1] # define your data in 1D
new_data_resolution = 0.5 # define new sampling distance (i.e, your x-axis resolution)
interp_data = np.interp(
x = np.arange(0, 5-1+new_data_resolution , new_data_resolution), # new sampling points (new axis)
xp = range(original_data),
fp = original_data
)
# now interp_data contains (5-1) / 0.5 + 1 = 9 points
After this, you will have a (5-1) / new_resolution (which is greater than 5, since new_resolution < 1)-length data, which values will be (in this case) a linear interpolation of your original data.
After you have achieved/understood this example, you can dive in the scipy.interpolate module to get a better understanding in the interpolation functions (my example uses a linear function to get the data in the missing points).
Applying this to n-D dimensional arrays is straight-forward, iterating over each dimension of your data.
My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?
I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.
using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.
I have a DataFrame from which I’m trying to build a multiple linear regression model. The problem I have is that one of my Y variables is heavily skewed within the data set, so it’s weighting one side far too heavily. I need a way to normalize that one column, and the only way I can think to do that is to select and delete rows until I have an evenly distributed data set. I’ve built a simple example of what I’m talking about below. I would want column [0] to end up normally distributed by getting rid of the low tail. What’s the best way to go about doing this?
import pandas as pd
from matplotlib import pyplot as plt
from numpy.random import seed
from numpy.random import randn
from numpy.random import rand
from numpy import append
seed(1)
data=5*randn(100) + 10
tail = 10 + (rand(50) * 100)
data=append(data, tail)
data2=5*randn(150)+ 10
s1 = pd.Series(data)
s2 = pd.Series(data2)
df = pd.concat([s1, s2], axis=1)
First you need to figure out a threshold value to discriminate which values belong to the tail (are too higher) and which not.
A very empirical way to do it is by visual inspection: plot an histogram of your data, and see where the tail starts.
plt.hist(df[0])
plt.show()
Using the sample data you provided, you could see that the tail starts at 20, so you can consider each value greater than 20 due to the tail of the distribution.
Of course, this is a very rough way. Depending on your real data, you may have a better way to define your threshold, maybe based on the theoretical model behind the data. I mean, I guess you should know or at least have an idea about why there is a tail in your distribution.
In any case, whatever criteria you use to define a threshold value (this is really up to you), once you have it you can simply set to NaN all the values greater than the threshold:
df[0].loc[df[0] > threshold] = np.nan
Disclaimer:
This approach may be considered unappropiate or wrong, because you are tampering with the data. I don't know what your final goal is, but be careful.
You can try to use RANSAC for this. Use Skewness as objective function and try to minimize it. This should give you the samples that belong to an unskewed distribution. (Example, Example with different model, Example)
I've got a bunch of points [ID, lat, lon, time] but the time is unreliable. The time for a couple points is often mixed up and there are some large gaps. I want to be able to calculate a track (basically just a linear-fit or polyfit) from the points but I'm struggling to get them into some kind of order.
First I tried ordering by lat/lon and this works for the cases where the track is moving constantly in one direction. There are all kind of mismatches and problems when the track turns back on itself.
Maybe it's a travelling salesman problem but in this case I don't know where the object's track starts/ends.
I've thought about picking a point at random and travelling to the next closest point and repeat; but how would I complete the track if my random point is in the middle and there are often large gaps between points.
GPS points, incorrectly placed into tracks
Here's a picture of some of the GPS points, colour coded by ID. I've sorted the points by [lat,lon] and you can see the blue track has problems.
This is so simple do do manually, just join the dots, but I can't figure it out computationally. I'm using python/numpy/pandas for this and there are millions of these points so it would be helpful to avoid computationally intensive methods but at this point I'm just plain stuck.
EDIT:
Okay, so this is not so simple. It's probably going to involve writing particle/Kalman filters or maybe some kind of Hamiltonian cost equation and then iterating the whole damn track to get an optimal solution. The best (least work for me) solution would be to try to correct the junk time field and possible build a statistical guesser from the average bearing of point segments.
EDIT + Solution:
Okay, so it's no that complex. The data I'm looking at the objects generally travel N-S or E-W with little deviation. Where there is complex manoeuvring I usually have more reliable time data. The non-general solution for my dataset would be to check whether the track can be defined as a function of latitude (N-S travel with no S-N movement component) else can it be function of longitude. Then order by lat/lon and bam. This won't work in the case of spirals or other complex tracks but those are minimal in my data.
Not the perfect solution but good enough for me.
Hm, seems like simple clustering - or even sorting with the right metric would do the trick.
Your data
from IPython.display import Image
Image('http://i.stack.imgur.com/76pNx.png')
Generate similar data
import numpy as np
np.random.seed(42)
data_lat = np.arange(300, dtype=np.int32) * (1 + (np.random.random(300) - 0.5) * 0.1)
data_lon = np.arange(5, 305, dtype=np.int32) * (1 + (np.random.random(300) - 0.5) * 0.1)
%pylab inline
import seaborn as sns
plt.scatter(data_lat, data_lon)
import itertools
seq_data = [(la, lo) for i, (la, lo) in enumerate(zip(data_lat, data_lon)) if
i in itertools.chain(range(20), range(55, 70), range(120, 165),
range(200, 250), range(280, 300))]
plt.plot(*zip(*seq_data))
plt.scatter(*zip(*seq_data))
Scramble
import random
random.seed(42)
data = seq_data.copy()
random.shuffle(data)
plt.plot(*zip(*data))
plt.scatter(*zip(*data))
Sort
data.sort(key=lambda t: (t[0]**2 + t[1]**2)**(1/2))
plt.plot(*zip(*data))
plt.scatter(*zip(*data))
I have big data set, representing 1.2M points in 220 dimensional periodic space (x changes fom (-pi,pi))... (matrix: 1.2M x 220).
I would like to calculate histogram of distances between these points taking into account periodicity. I have written some code in python but still it works quite slow for my test case (I am not even trying to run it on the whole set...).
Can you maybe take a look and help me with some tweaking?
Any suggestions and comments much appreciated.
import numpy as np
# 1000x220 test set (-pi,pi)
d=np.random.random((1000, 220))*2*np.pi-np.pi
# calculating theoretical limit on the histogram range, max distance between
# two points can be pi in each dimension
m=np.zeros(np.shape(d)[1])+np.pi
m_=np.sqrt(np.sum(m**2))
# hist range is from 0 to mm
mm=np.floor(m_)
bins=mm/0.01
m=np.zeros(bins)
# proper calculations
import time
start_time = time.time()
for i in range(np.shape(d)[0]):
diff=d[:-(i+1),:]-d[i+1:,:]
diff=np.absolute(diff)
adiff=diff-np.pi
diff=np.pi-np.absolute(adiff)
s=np.sqrt(np.einsum('ij,ij->i', diff,diff))
m+=np.histogram(s,range=(0,mm),bins=bins)[0]
print time.time() - start_time
I think you will see the most improvement from breaking the main loop to smaller parts by dividing range(...) to a couple of smaller ranges and use the threading module to have a couple of threads run the loop concurrently