How to verify proper shape of time series with ML [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I need to validate the correctness of heating/cooling cycle based on reading of temperature sensor in time.
Correct time series has a certain shape (number of ups and downs), lasts more or less the same amount of time, and has a certain max temperature which needs to be met during cycle.
Typically the process is faulty when it is compressed or extruded in time. Has too low temperatures at peaks or in general the heating/cooling envelope is messed up. On the above picture I posted a simplified example of proper and wrong cycles of the process.
What classifier you would recommend for supervised learning models? Is unsupervised model at all possible to be developed for such scenario?
I am currently using calculation of max value of Temperature and Cross Correlation of 1 master typical proper cycle vs the tested one, but I wonder if there is a better more generic way to tackle the problem.

imho machine learning is overengineering this problem, some banding and counting peaks seems to be the much easier approach to me.
Nontheless if you want machinelearning i would go with autoencoders for anomaly detection, examples can be found here or here.
Tl/dr:
The idea is an autoencoder reconstructs the input though a very small bottleneck (i.e one value, could be the phase) so any current point will construct a good looking curve. Then it gets compared to the actual curve. If it fits all is good, if it doesnt you know something is not right.

Related

Controlled Variables in Logistic Regression in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just learned what controlled variables mean for a project that I am doing, and I was trying to find if sci-kit learn has a controlled variable option. Specifically, does Python have controlled variable (not independent variables) for the logistic regression?
I googled stuff and found nothing for Python. However, I was thinking more basic and that controlled variables means stratifying the group you are interested (say race) and then going analysis on each group based on your x's and y. If this is correct, then I am suppose to interpret the results from those stratified groups, right?
Sorry, I asked two questions, but I am trying to gain much info on this controlled group idea and applications on Python
As you may know that control variables are those variables which the experimenter is not interested in studying, but believes that they do have a significant role in the value which your dependent variable takes. So people generally hold the value of this variable a constant when they run their experiments i.e. collecting data.
To give an example assume that you are trying to model the health condition of a person i.e. classify if he is healthy or not and you are considering age, gender and his/her exercise pattern as inputs to your model and want to study how each and every input affects your target variable. But you very well know that the country in which the subject is residing will also have a say on his health condition (which encodes the climate, heath facility etc.). So in order to make sure that this variable (country) is not affecting your model, you make sure that you collect all your data from just one country.
So answering your first question, no python does not account have controlled variables. It just assumes that all the input variables you are feeding in are of the interest to the experimenter.
Coming to your second question, one way of handling control variables it by first grouping the data with respect to it, so that each group now has a constant value for that control variable, now we run Logistic regression or any model for each group separately and then 'pool' the results from different models. But this approach falls apart if the number of levels in your control variable is really high, in which case we generally consider the control variable as an independent variable and feed it to our model.
For more details please refer to 1 or 2, they really have some nice explanations.

Pattern Recognition Challenge [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I will state up front, I am not a Data Scientist, but have the wherewithal to learn what I need to know. However, I need advice as to where to look and the appropriate algorithms to study.
The problem as as follows. I have 10 years of 1 hour observations of output from a sensor. For the argument, let's use the output of a Weather Station and more specifically, a solar panel, in the form of a float in milliVolts.
You might argue that if a 24 Hour subset of data from this Time Series (24 Points) were taken as a matrix for comparison to the historical Time Series, one could identify "sunny" days in the past. If we were to take the latest 24 hrs of data as a comparison, we might be able to identify days that were "similar" to today and thereby taking the next subsequent matrix from a matched position, "predict" what is going to happen tomorrow, from historical action.
This is of course a rough analogy, but illustrates my problem.
I wish to take an arbitrary 24 hr period from the Time Series (Lets call this Matrix a) and identify from the Time Series (000s of Matrices) those 24 hr periods that are similar.
I have reviewed a lot around this subject in the form of various types of Regression and at one stage identified that the Data Compression algorithms would be the most effective, if you could source the subsequent dictionary made from the process, however, I realised the matching in this case is "exact" and I wish for "similar".
I have settled on what I believe to be correct, "L1 Penalty and Sparsity in Logistic Regression" located here.
Where I (if I understand correctly) take a comparison Matrix, compare it to others and get a score for "similarity" (In this case called C). From here I can carry on with my experiment.
If some kind hearted Data Scientist might me a favor and 1. Confirm my direction effective or, if not 2. Point me to where I might find the process to answer my problem, I would be eternally grateful.
Many thanks in advance
ApteryxNZ
For timeseries forecasting (prediction), you can search about LSTM neural network, SVM and even MLP. I've seen timeseries forecasting with simpler classifiers, such as AODE.
To filter the data (if applicable) that you will input to your timeseries you can search about Granger causality, Particle Sworm optimizations and even genetic algorithms
For finding similar patterns in the timeseries, I think your best option is using Dynamic Time Warping (DTW) used for speech recognition
You can search about related work in some journals such as:
Pattern Recognition Letters
Pattern Recognition
Neurocomputing
Applied Soft Computing
Information Sciences
Machine Learning
Neural Networks
IEEE Transaction on Neural Networks and Learning Systems
Note that this really depends how you define "similar".
One simple way would be the "nearest neighbors" approach: treat your data points as 24-dimensional vectors, then find the ones with the shortest Euclidean (or Manhattan or…) distance to your goal point. Those are the most similar days. (k-d trees can speed up this process significantly.)
But, 24 dimensions might be too much for your purposes. Principal Component Analysis (PCA) can reduce them from 24-dimensional points to some lower number of dimensions, while preserving the variation as much as possible. Then finding the closest points will be much faster.
Note that both of these methods will only work if you're comparing value-by-value, that is, if you don't consider "the same but one hour later" to be particularly similar.

Vehicle counting based on classification (cars,buses,trucks)

I am currently working on vehicle platooning for which I need to design a code in python opencv for counting the number of vehicles based on the classification.The input is a real time traffic video.
The aim is finding an average size "x" for the bounding box and say that for cars its "x", for buses its "3x" and so on.Based on size of "x" or multiples of "x", determine the classification.Is there any possible way I can approach this problem statement?
Haar-cascades are a good method, however training them takes a lot of time as well as effort.
You can get a lot of trained cascades file online.
Second approach could be of getting the contours from the image and then proceeding forward.
- Original image
- Smooth the image so that you will get an image without edges.
- (Original image- Smooth image) to get the edges
- Get Contours from image
I have worked on almost similar problem.
Easiest way is to train a Haar-cascade on the vehicles of similar size.
You will have to train multiple cascades based on the number of categories.
Data for the cascade can be downloaded from any used car selling site using some browser plugin.
The negative sets pretty much depend on the context in which this solution will be used.
This also brings the issue that, if you plan to do this on a busy street, there are going to be many unforeseen scenarios. For example, pedestrian walking in the FoV. Also, FoV needs to be fixed, especially the distance from which objects are observed. Trail and error is the only way to achieve sweet-spot for the thresholds, if any.
Now I am going to suggest something outside the scope of the question you asked.
Though this is purely image processing based approach, you can turn the problem on its face, and ask a question 'Why' classification is needed? Depending on the use-case, more often than not, it will be possible to train a deep reinforcement learning agent. It will solve the problem without getting into lot of manual work.
Let me know in case of a specific issues.

What's your rule of thumb for initially selecting a machine learning algorithm/doing your initial setup? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
My initial practice was rapid prototyping, did a random classifier, 1R classifier, tried some feature engineering, linear regression, logistic regression, naive bayes...etc etc.
I did all of this in a jumbled-up, incoherent manner however. What I would like to know is, if you were given a set of data (for the sake of argument, the data posted above) how would you analyse it to find a suitable classifier? What would you look at to extract meaning from your dataset initially?
Is what I have done correct in this age of high-level programming where I can run 5/6 algorithms on my data in a night? Is a rapid prototyping approach the best idea here or is there a more reasoned, logical approach that can be taken?
At the moment, I have cleaned up the data, removing all the meaningless rows (there is a small amount of these so they can just be discarded). I have written a script to cross validate my classifier, so I have a metric to test for bias/variance and also to check overall algorithm performance.
Where do I go from here? What aspects do I need to consider? What do I think about here?
You could throw in some elements of theory. For example:
the naive bayes classifier assumes that all variables are independent. Maybe that's not the case?
But this classifier is fast and easy, so it's still a good choice for many problems, even if the variables are not really independent.
the linear regression gives too much weight on samples that are far away from the classification boundary. That's usually a bad idea.
the logistic regression is an attempt to fix this problem, but still assumes a linear correlation between the input variables. In other words, the boundary between the classes is a plane in the input variable space.
When I study a dataset, I typically start by drawing the distribution of each variable for each class of samples to find the most discriminating variables.
Then, for each class of samples, I usually plot a given input variable versus another to study the correlations between the variables: are there non-linear correlations? if yes, I might choose classifiers that can handle such correlations.
Are there strong correlations between two input variables?
if yes, one of the variables could be dropped to reduce the dimensionality of the problem.
These plots will also allow you to spot problems in your dataset.
But after all, trying many classifiers and optimizing their parameters for best results in the cross validation as you have done is a pragmatic and valid approach, and this has to be done at some point anyway.
I understand from the tags in this post that you have used the classifiers of scikit-learn.
In case you have not noticed yet, this package provides powerful tools for cross validation as well http://scikit-learn.org/stable/modules/cross_validation.html

Artificial life with neural networks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to build a simple evolution simulation of agents controlled by neural network. In the current version each agent has feed-forward neural net with one hidden layer. The environment contains fixed amount of food represented as a red dot. When an agent moves, he loses energy, and when he is near the food, he gains energy. Agent with 0 energy dies. the input of the neural net is the current angle of the agent and a vector to the closest food. Every time step, the angle of movement of each agent is changed by the output of its neural net. The aim of course is to see food-seeking behavior evolves after some time. However, nothing happens.
I don't know if the problem is the structure the neural net (too simple?) or the reproduction mechanism: to prevent population explosion, the initial population is about 20 agents, and as the population becomes close to 50, the reproduction chance approaches zero. When reproduction does occur, the parent is chosen by going over the list of agents from beginning to end, and checking for each agent whether or not a random number between 0 to 1 is less than the ratio between this agent's energy and the sum of the energy of all agents. If so, the searching is over and this agent becomes a parent, as we add to the environment a copy of this agent with some probability of mutations in one or more of the weights in his neural network.
Thanks in advance!
If the environment is benign enough (e.g it's easy enough to find food) then just moving randomly may be a perfectly viable strategy and reproductive success may be far more influenced by luck than anything else. Also consider unintended consequences: e.g if offspring is co-sited with its parent then both are immediately in competition with each other in the local area and this might be sufficiently disadvantageous to lead to the death of both in the longer term.
To test your system, introduce an individual with a "premade" neural network set up to steer the individual directly towards the nearest food (your model is such that such a thing exists and is reasobably easy to write down, right? If not, it's unreasonable to expect it to evolve!). Introduce that individual into your simulation amongst the dumb masses. If the individual doesn't quickly dominate, it suggests your simulation isn't set up to reinforce such behaviour. But if the individual enjoys reproductive success and it and its descendants take over, then your simulation is doing something right and you need to look elsewhere for the reason such behaviour isn't evolving.
Update in response to comment:
Seems to me this mixing of angles and vectors is dubious. Whether individuals can evolve towards the "move straight towards nearest food" behaviour must rather depend on how well an atan function can be approximated by your network (I'm sceptical). Again, this suggests more testing:
set aside all the ecological simulation and just test perturbing a population
of your style of random networks to see if they can evolve towards the expected function.
(simpler, better) Have the network output a vector (instead of an angle): the direction the individual should move in (of course this means having 2 output nodes instead of one). Obviously the "move straight towards food" strategy is then just a straight pass-through of the "direction towards food" vector components, and the interesting thing is then to see whether your random networks evolve towards this simple "identity function" (also should allow introduction of a readymade optimised individual as described above).
I'm dubious about the "fixed amount of food" too. (I assume you mean as soon as a red dot is consumed, another one is introduced). A more "realistic" model might be to introduce food at a constant rate, and not impose any artificial population limits: population limits are determined by the limitations of food supply. e.g If you introduce 100 units of food a minute and individuals need 1 unit of food per minute to survive, then your simulation should find it tends towards a long term average population of 100 individuals without any need for a clamp to avoid a "population explosion" (although boom-and-bust, feast-or-famine dynamics may actually emerge depending on the details).
This sounds like a problem for Reinforcement Learning, there is a good online textbook too.

Categories