Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I will state up front, I am not a Data Scientist, but have the wherewithal to learn what I need to know. However, I need advice as to where to look and the appropriate algorithms to study.
The problem as as follows. I have 10 years of 1 hour observations of output from a sensor. For the argument, let's use the output of a Weather Station and more specifically, a solar panel, in the form of a float in milliVolts.
You might argue that if a 24 Hour subset of data from this Time Series (24 Points) were taken as a matrix for comparison to the historical Time Series, one could identify "sunny" days in the past. If we were to take the latest 24 hrs of data as a comparison, we might be able to identify days that were "similar" to today and thereby taking the next subsequent matrix from a matched position, "predict" what is going to happen tomorrow, from historical action.
This is of course a rough analogy, but illustrates my problem.
I wish to take an arbitrary 24 hr period from the Time Series (Lets call this Matrix a) and identify from the Time Series (000s of Matrices) those 24 hr periods that are similar.
I have reviewed a lot around this subject in the form of various types of Regression and at one stage identified that the Data Compression algorithms would be the most effective, if you could source the subsequent dictionary made from the process, however, I realised the matching in this case is "exact" and I wish for "similar".
I have settled on what I believe to be correct, "L1 Penalty and Sparsity in Logistic Regression" located here.
Where I (if I understand correctly) take a comparison Matrix, compare it to others and get a score for "similarity" (In this case called C). From here I can carry on with my experiment.
If some kind hearted Data Scientist might me a favor and 1. Confirm my direction effective or, if not 2. Point me to where I might find the process to answer my problem, I would be eternally grateful.
Many thanks in advance
ApteryxNZ
For timeseries forecasting (prediction), you can search about LSTM neural network, SVM and even MLP. I've seen timeseries forecasting with simpler classifiers, such as AODE.
To filter the data (if applicable) that you will input to your timeseries you can search about Granger causality, Particle Sworm optimizations and even genetic algorithms
For finding similar patterns in the timeseries, I think your best option is using Dynamic Time Warping (DTW) used for speech recognition
You can search about related work in some journals such as:
Pattern Recognition Letters
Pattern Recognition
Neurocomputing
Applied Soft Computing
Information Sciences
Machine Learning
Neural Networks
IEEE Transaction on Neural Networks and Learning Systems
Note that this really depends how you define "similar".
One simple way would be the "nearest neighbors" approach: treat your data points as 24-dimensional vectors, then find the ones with the shortest Euclidean (or Manhattan or…) distance to your goal point. Those are the most similar days. (k-d trees can speed up this process significantly.)
But, 24 dimensions might be too much for your purposes. Principal Component Analysis (PCA) can reduce them from 24-dimensional points to some lower number of dimensions, while preserving the variation as much as possible. Then finding the closest points will be much faster.
Note that both of these methods will only work if you're comparing value-by-value, that is, if you don't consider "the same but one hour later" to be particularly similar.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I need to validate the correctness of heating/cooling cycle based on reading of temperature sensor in time.
Correct time series has a certain shape (number of ups and downs), lasts more or less the same amount of time, and has a certain max temperature which needs to be met during cycle.
Typically the process is faulty when it is compressed or extruded in time. Has too low temperatures at peaks or in general the heating/cooling envelope is messed up. On the above picture I posted a simplified example of proper and wrong cycles of the process.
What classifier you would recommend for supervised learning models? Is unsupervised model at all possible to be developed for such scenario?
I am currently using calculation of max value of Temperature and Cross Correlation of 1 master typical proper cycle vs the tested one, but I wonder if there is a better more generic way to tackle the problem.
imho machine learning is overengineering this problem, some banding and counting peaks seems to be the much easier approach to me.
Nontheless if you want machinelearning i would go with autoencoders for anomaly detection, examples can be found here or here.
Tl/dr:
The idea is an autoencoder reconstructs the input though a very small bottleneck (i.e one value, could be the phase) so any current point will construct a good looking curve. Then it gets compared to the actual curve. If it fits all is good, if it doesnt you know something is not right.
I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.
I am implementing an anomaly detection system that will be used on different time series (one observation every 15 min for a total of 5 months). All these time series have a common pattern: high levels during working hours and low levels otherwise.
The idea presented in many papers is the following: build a model to predict future values and calculate an anomaly score based on the residuals.
What I have so far
I use an LSTM to predict the next time step given the previous 96 (1 day of observations) and then I calculate the anomaly score as the likelihood that the residuals come from one of the two normal distributions fitted on the residuals obtained with the validation test. I am using two different distributions, one for working hours and one for non working hours.
The model detects very well point anomalies, such as sudden falls and peaks, but it fails during holidays, for example.
If an holiday is during the week, I expect my model to detect more anomalies, because it's an unusual daily pattern wrt a normal working day.
But the predictions simply follows the previous observations.
My solution
Use a second and more lightweight model (based on time series decomposition) which is fed with daily aggregations instead of 15min aggregations to detect daily anomalies.
The question
This combination of two models allows me to have both anomalies and it works very well, but my idea was to use only one model because I expected the LSTM to be able to "learn" also the weekly pattern. Instead it strictly follows the previous time steps without taking into consideration that it is a working hour and the level should be much higher.
I tried to add exogenous variables to the input (hour of day, day of week), to add layers and number of cells, but the situation is not that better.
Any consideration is appreciated.
Thank you
A note on your current approach
Training with MSE is equivalent to optimizing the likelihood of your data under a Gaussian with fixed variance and mean given by your model. So you are already training an autoencoder, though you do not formulate it so.
About the things you do
You don't give the LSTM a chance
Since you provide data from last 24 hours only, the LSTM cannot possibly learn a weekly pattern.
It could at best learn that the value should be similar as it was 24 hours before (though it is very unlikely, see next point) -- and then you break it with Fri-Sat and Sun-Mon data. From the LSTM's point of view, your holiday 'anomaly' looks pretty much the same as the weekend data you were providing during the training.
So you would first need to provide longer contexts during learning (I assume that you carry the hidden state on during test time).
Even if you gave it a chance, it wouldn't care
Assuming that your data really follows a simple pattern -- high value during and only during working hours, plus some variations of smaller scale -- the LSTM doesn't need any long-term knowledge for most of the datapoints. Putting in all my human imagination, I can only envision the LSTM benefiting from long-term dependencies at the beginning of the working hours, so just for one or two samples out of the 96.
So even if the loss value at the points would like to backpropagate through > 7 * 96 timesteps to learn about your weekly pattern, there are 7*95 other loss terms that are likely to prevent the LSTM from deviating from the current local optimum.
Thus it may help to weight the samples at the beginning of working hours more, so that the respective loss can actually influence representations from far history.
Your solutions is a good thing
It is difficult to model sequences at multiple scales in a single model. Even you, as a human, need to "zoom out" to judge longer trends -- that's why all the Wall Street people have Month/Week/Day/Hour/... charts to watch their shares' prices on. Such multiscale modeling is especially difficult for an RNN, because it needs to process all the information, always, with the same weights.
If you really want on model to learn it all, you may have more success with deep feedforward architectures employing some sort of time-convolution, eg. TDNNs, Residual Memory Networks (Disclaimer: I'm one of the authors.), or the recent one-architecture-to-rule-them-all, WaveNet. As these have skip connections over longer temporal context and apply different transformations at different levels, they have better chances of discovering and exploiting such an unexpected long-term dependency.
There are implementations of WaveNet in Keras laying around on GitHub, e.g. 1 or 2. I did not play with them (I've actually moved away from Keras some time ago), but esp. the second one seems really easy, with the AtrousConvolution1D.
If you want to stay with RNNs, Clockwork RNN is probably the model to fit your needs.
About things you may want to consider for your problem
So are there two data distributions?
This one is a bit philosophical.
Your current approach shows that you have a very strong belief that there are two different setups: workhours and the rest. You're even OK with changing part of your model (the Gaussian) according to it.
So perhaps your data actually comes from two distributions and you should therefore train two models and switch between them as appropriate?
Given what you have told us, I would actually go for this one (to have a theoretically sound system). You cannot expect your LSTM to learn that there will be low values on Dec 25. Or that there is a deadline and this weekend consists purely of working hours.
Or are there two definitions of anomaly?
One philosophical point more. Perhaps you personally consider two different types of anomaly:
A weird temporal trajectory, unexpected peaks, oscillations, whatever is unusual in your domain. Your LSTM supposedly handles these already.
And then, there is different notion of anomaly: Value of certain bound in certain time intervals. Perhaps a simple linear regression / small MLP from time to value would do here?
Let the NN do all the work
Currently, you effectively model the distribution of your quantity in two steps: First, the LSTM provides the mean. Second, you supply the variance.
You might instead let your NN (together with additional 2 affine transformations) directly provide you with a complete Gaussian by producing its mean and variance; much like in Variational AutoEncoders (https://arxiv.org/pdf/1312.6114.pdf, appendix C.2). Then, you need to optimize directly the likelihood of your following sample under the NN-distribution, rather than just MSE between the sample and the NN output.
This will allow your model to tell you when it is very strict about the following value and when "any" sample will be OK.
Note, that you can take this approach further and have your NN produce "any" suitable distribution. E.g. if your data live in-/can be sensibly transformed to- a limited domain, you may try to produce a Categorical distribution over the space by having a Softmax on the output, much like WaveNet does (https://arxiv.org/pdf/1609.03499.pdf, Section 2.2).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
My initial practice was rapid prototyping, did a random classifier, 1R classifier, tried some feature engineering, linear regression, logistic regression, naive bayes...etc etc.
I did all of this in a jumbled-up, incoherent manner however. What I would like to know is, if you were given a set of data (for the sake of argument, the data posted above) how would you analyse it to find a suitable classifier? What would you look at to extract meaning from your dataset initially?
Is what I have done correct in this age of high-level programming where I can run 5/6 algorithms on my data in a night? Is a rapid prototyping approach the best idea here or is there a more reasoned, logical approach that can be taken?
At the moment, I have cleaned up the data, removing all the meaningless rows (there is a small amount of these so they can just be discarded). I have written a script to cross validate my classifier, so I have a metric to test for bias/variance and also to check overall algorithm performance.
Where do I go from here? What aspects do I need to consider? What do I think about here?
You could throw in some elements of theory. For example:
the naive bayes classifier assumes that all variables are independent. Maybe that's not the case?
But this classifier is fast and easy, so it's still a good choice for many problems, even if the variables are not really independent.
the linear regression gives too much weight on samples that are far away from the classification boundary. That's usually a bad idea.
the logistic regression is an attempt to fix this problem, but still assumes a linear correlation between the input variables. In other words, the boundary between the classes is a plane in the input variable space.
When I study a dataset, I typically start by drawing the distribution of each variable for each class of samples to find the most discriminating variables.
Then, for each class of samples, I usually plot a given input variable versus another to study the correlations between the variables: are there non-linear correlations? if yes, I might choose classifiers that can handle such correlations.
Are there strong correlations between two input variables?
if yes, one of the variables could be dropped to reduce the dimensionality of the problem.
These plots will also allow you to spot problems in your dataset.
But after all, trying many classifiers and optimizing their parameters for best results in the cross validation as you have done is a pragmatic and valid approach, and this has to be done at some point anyway.
I understand from the tags in this post that you have used the classifiers of scikit-learn.
In case you have not noticed yet, this package provides powerful tools for cross validation as well http://scikit-learn.org/stable/modules/cross_validation.html
I was analyzing some geographical data and attempting to predict/forecast next occurrence of event with respect to time and it geographical position. The data was in following order (with sample data)
Timestamp Latitude Longitude Event
13307266 102.86400972 70.64039541 "Event A"
13311695 102.8082912 70.47394645 "Event A"
13314940 102.82240522 70.6308513 "Event A"
13318949 102.83402128 70.64103035 "Event A"
13334397 102.84726242 70.66790352 "Event A"
First step was classifying it into 100 zones, so that reduces dimensions and complexity.
Timestamp Zone
13307266 47
13311695 65
13314940 51
13318949 46
13334397 26
Next step was to do time series analysis then I got stuck here for 2 months, read around a lot of literature and figured these were my options
* ARIMA (auto-regression method)
* Machine Learning
I wanted to utilize Machine learning to forecast using python but couldn't really figure out how.Specifically are there any python libraries/open-source-code specific for use case, which I can build upon.
EDIT 1:
To clarify, data is loosely dependent on past data but over a period of time is uniformly distributed.
The best way to visualize the data would be, to imagine N number of agents controlled by a algorithm which allots them task of picking resource from grids. Resources are function of socioeconomic structure of society and also strongly dependent on geography. Its in interest of " algorithm " to be able to predict demand zone and time wise.
p.s:
For Auto-regressive models like ARIMA Python already has a library http://pypi.python.org/pypi/statsmodels .
Without example data or existing code I can't offer you anything concrete.
However, often it's helpful to re-phrase your problem in the nomenclature of the field you want to explore. In ML terms:
Your problem's features: How your inputs are specified. Timestamp is continuous, geographic zone is discrete.
Your problems's target label: an event, precisely whether or not a given event has occurred.
Your problem is supervised: target labels for previous data are available. You have previous instances of (timestamp, geographic zone) to event mappings.
The target label is discrete, so this is a classification problem (as opposed to a regression problem, where the output is continuous).
So I'd say you have a supervised classification problem. As an aside you may want to do some sort of time regularisation first; I'm guessing there are going to be patterns of the events depending on what time of the day, day of the month, or month of the year it is, and you may want to represent this as an additional feature.
Taking a look at one of the popular Python ML libraries available, scikit-learn, here:
http://scikit-learn.org/stable/supervised_learning.html
and consulting a recent posting on a cheatsheet for scikit-learn by one of the contributors:
http://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
Your first good bet would be to try Support Vector Machines (SVM), and if that fails maybe give k Nearest Neighbours (kNN) a shot as well. Note that using an ensemble classifier is usually superior than using just one instance of a given SVM/kNN.
How, exactly, to apply SVM/kNN with time as a feature may require more research, since AFAIK (and others will probably correct me) SVM/kNN require bounded inputs with a mean of zero (or normalised to have a mean of zero). Just doing some random Googling you may be able to find certain SVM kernels, for example a Fourier kernel, that can transform a time-series feature for you:
SVM Kernels for Time Series Analysis
http://www.stefan-rueping.de/publications/rueping-2001-a.pdf
scikit-learn handily allows you to specify a custom kernel for an SVM. See:
http://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#example-svm-plot-custom-kernel-py
With your knowledge of ML nomenclature, and example data in hand, you may want to consider posting the question to Cross Validated, the statistics Stack Exchange.
EDIT 1: Thinking about this problem more you need to really understand if your features and corresponding labels are independent and identically distributed (IID) or not. For example what if you were modelling how forest fires spread over time. It's clear that the likelihood of a given zone catches fire is contingent on its neighbours being on fire or not. AFAIK SVM and kNN assume the data is IID. At this point I'm starting to get out of my depth, but I think you should at least give several ML methods a shot and see what happens! Remember to cross-validate! (scikit-learn does this for you).