I am trying to write a python code which detects anomalies in time series data. My input data looks something like this:
Here, the regions marked in red are anomalies. I want it such that I get the x-coordinate of data-points which are anomalous. So far I have tried a basic if condition (ie if rate < 100, data-point is anomalous) and various statistical techniques like: Mean, Standard deviation, Rolling average with different window sizes etc. However, none of them have worked well. Is there a way to achieve what i want with using some statistical methods? If there are no simple ways to do this, I understand that I have to look to machine learning algorithms. In that case which algorithm would be suitable for my dataset? Thank you.
It looks as if your data comes in lumps, if you are able to distinguish between the lumps (maybe a certain delay between two samples), you can look at the distribution of the samples in the lump. If you know that your rate will never drop below 100, I would start with that, to clean it up a bit,then look at the remaining distribution. The mode value should kind of help identify the "middle", most occuring value. Cutting off everything a certain amount of standard deviations would maybe work to get clean data, but no guarantee that you won't cut off any of your required data.
Edit: you'd have to bin your data before getting the mode.
Related
I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?
Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.
In a data science task I have some physical data from the instrument and need to predict continous time value. The data is divided into signal samples with some peaks occuring before that target time. In order to create new features I will have to use some statistical information about the signal - but not necessarily for the whole signal sample.
I was thinking about dividing the sample into chunks and use statistical data derived from these chunks as separate features.
I could divide the sample into say 1000 chunks. But it can be that such division doesn't make much sense. Maybe it would be better to get statistical info from the first 10% of the sample, then, say, last 20% and so on. Or at least use some other value for division based on the specific sample. Maybe for some samples dividing into 1000 chunks is good but for some others it should 500 or 2000 etc.
My idea was to use Neural Network to derive that division value (or maybe a few values, like the number of chunks and their sizes)
Does it makes sense at all and if yes, any ideas how to do that? It sounds like something like parameter optimisation using neural network but googling such thing didn't give me the required result.
Maybe someone stumbled upon similar problem?
I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.
I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
Assume the following is true: the success data should be normally distributed with mean 0.80 and s.d. 0.10. When I look at the histogram of my_data['success'] it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.
So this is my problem: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution".
I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code.
Constraints: I need to keep at least half the original events.
Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.
Thanks!
EDIT: After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..."
Keep in mind that feature selection is to select features, not samples, i.e., (typically) the columns of your DataFrame, not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?
Also, what about feature scaling, e.g., standardization, so that your data becomes normal distributed with mean=0 and sd=1?
The equation is simply z = (x - mean) / sd
To apply it to your DataFrame, you can simply do
my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))
However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn
My Question is as follows:
I know a little bit about ML in Python (using NLTK), and it works ok so far. I can get predictions given certain features. But I want to know, is there a way, to display the best features to achieve a label? I mean the direct opposite of what I've been doing so far (put in all circumstances, and get a label for that)
I try to make my question clear via an example:
Let's say I have a database with Soccer games.
The Labels are e.g. 'Win', 'Loss', 'Draw'.
The Features are e.g. 'Windspeed', 'Rain or not', 'Daytime', 'Fouls committed' etc.
Now I want to know: Under which circumstances will a Team achieve a Win, Loss or Draw? Basically I want to get back something like this:
Best conditions for Win: Windspeed=0, No Rain, Afternoon, Fouls=0 etc
Best conditions for Loss: ...
Is there a way to achieve this?
My paint skills aren't the best!
All I know is theory, so well you'll have to look for the code..
If you have only 1 case(The best for "x" situations) the diagram becomes something like (It won't be 2-D, but something like this):
Green (Win), Orange(Draw), Red(Lose)
Now if you want to predict whether the team wins, loses or draws, you have (at least) 2 models to classify:
Linear Regression, the separator is the Perpendicular bisector of the line joining the 2 points:
K-nearest-neighbours: it is done just by calculating the distance from all the points, and classifying the point as the same as the closest..
So, for example, if you have a new data, and have to classify it, here's how:
We have a new point, with certain attributes..
We classify it by seeing/calculating which side of the line the point comes in (or seeing how far it is from our benchmark situations...
Note: You will have to give some weightage to each factor, for more accuracy..
You could compute the representativeness of each feature to separate the classes via feature weighting. The most common method for feature selection (and therefore feature weighting) in Text Classification is chi^2. This measure will tell you which features are better. Based on this information you can analyse the specific values that are best for every case. I hope this helps.
Regards,
Not sure if you have to do this in python, but if not, I would suggest Weka. If you're unfamiliar with it, here's a link to a set of tutorials: https://www.youtube.com/watch?v=gd5HwYYOz2U
Basically, you'd just need to write a program to extract your features and labels and then output a .arff file. Once you've generated a .arff file, you can feed this to Weka and run myriad different classifiers on it to figure out what model best fits your data. If necessary, you can then program this model to operate on your data. Weka has plenty of ways to analyze your results and to graphically display said results. It's truly amazing.