Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just learned what controlled variables mean for a project that I am doing, and I was trying to find if sci-kit learn has a controlled variable option. Specifically, does Python have controlled variable (not independent variables) for the logistic regression?
I googled stuff and found nothing for Python. However, I was thinking more basic and that controlled variables means stratifying the group you are interested (say race) and then going analysis on each group based on your x's and y. If this is correct, then I am suppose to interpret the results from those stratified groups, right?
Sorry, I asked two questions, but I am trying to gain much info on this controlled group idea and applications on Python
As you may know that control variables are those variables which the experimenter is not interested in studying, but believes that they do have a significant role in the value which your dependent variable takes. So people generally hold the value of this variable a constant when they run their experiments i.e. collecting data.
To give an example assume that you are trying to model the health condition of a person i.e. classify if he is healthy or not and you are considering age, gender and his/her exercise pattern as inputs to your model and want to study how each and every input affects your target variable. But you very well know that the country in which the subject is residing will also have a say on his health condition (which encodes the climate, heath facility etc.). So in order to make sure that this variable (country) is not affecting your model, you make sure that you collect all your data from just one country.
So answering your first question, no python does not account have controlled variables. It just assumes that all the input variables you are feeding in are of the interest to the experimenter.
Coming to your second question, one way of handling control variables it by first grouping the data with respect to it, so that each group now has a constant value for that control variable, now we run Logistic regression or any model for each group separately and then 'pool' the results from different models. But this approach falls apart if the number of levels in your control variable is really high, in which case we generally consider the control variable as an independent variable and feed it to our model.
For more details please refer to 1 or 2, they really have some nice explanations.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 months ago.
Improve this question
Our team developed a granular individual customer level ML model which can predict the revenue generated (dependent variable) for given amount of promotions (Calls, emails, meetings, impressions etc.) and few categorical details of customers (address, designation, segment etc.).
Once the model is in place, we were supposed to give the "optimal" promotions at which the revenue is maximized at individual level with certain constraints (like overall promotions spend under certain amount etc. we have costs for each promotions so essentially the constraints are like bounds on the independent feature values).
At present we simulate all possible promotional values for call, emails, meetings, impressions etc. (categorical variables do not change for individual customers) and get the best possible revenue case for each individual by model prediction.
The problem with this present approach is it is a brute force method. We have ~1 million customers and simulating for various quantities of promotional entries (lots of combinations) will end up with huge data and the prediction itself is taking significant time.
Given this is a simple maximization problem but one which have ML model prediction output (neural network in our case) as the function for given input features, We are looking for any optimizers that can solve this without need of a data simulation. It is like a solver function in excel where objective function is a ML model prediction.
A simple analogy - California Housing Prices is a well known ML regression problem. Assuming we have a well developed model (non-parametric model say a neural network), how can we estimate the best feature values (with some constraints) at which the house price is maximum for each county (imagine the data to have county information as well) without explicitly simulating the data.
I would encourage you to look into solutions like LIME or SHAP:
https://github.com/slundberg/shap - Model explanations via Game Theory.
https://github.com/marcotcr/lime - Model explanations via locally interpretable fitting.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I need to validate the correctness of heating/cooling cycle based on reading of temperature sensor in time.
Correct time series has a certain shape (number of ups and downs), lasts more or less the same amount of time, and has a certain max temperature which needs to be met during cycle.
Typically the process is faulty when it is compressed or extruded in time. Has too low temperatures at peaks or in general the heating/cooling envelope is messed up. On the above picture I posted a simplified example of proper and wrong cycles of the process.
What classifier you would recommend for supervised learning models? Is unsupervised model at all possible to be developed for such scenario?
I am currently using calculation of max value of Temperature and Cross Correlation of 1 master typical proper cycle vs the tested one, but I wonder if there is a better more generic way to tackle the problem.
imho machine learning is overengineering this problem, some banding and counting peaks seems to be the much easier approach to me.
Nontheless if you want machinelearning i would go with autoencoders for anomaly detection, examples can be found here or here.
Tl/dr:
The idea is an autoencoder reconstructs the input though a very small bottleneck (i.e one value, could be the phase) so any current point will construct a good looking curve. Then it gets compared to the actual curve. If it fits all is good, if it doesnt you know something is not right.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Currently I am researching about viable approaches to identify a certain object with the image processing techniques. However I ams struggling finding them. For example, I have a CNN capable of detecting certain objects, like a person, then I can track the person as well. However, my issue is that I want the identify the detected and tracked person like saving its credentials and giving an id. I do not want something like who is he/she. Just giving an id in that manner.
Any help/resource will be appreciated.
Create a database, Store the credentials you needed for later use e.g object type and some usable specifications, by giving them some unique ID. CNN already recognized the object so just need to store it in database and later on you can perform more processing on the generated data. Simple solution is that to the problem you are explaining.
Okay I got your problem that you want to identify what kind of object is being tracked because cnn is only tracking not identifying. For that purpose you have to train your CNN on some specific features and give them some identity like objectA has [x,y,z] features. Then CNN will help you in finding the identity of the object.
You can use openCv to do this as well, store some features of some specific objects, then use some distance matching technique to match the live feature with stored features.
Thanks.
I think you are looking for something called ReID. There are a lot of papers about it in CVPR2018.
You can imagine that you would need some sort of stored characteristic vector for each person. For each detected person, gives a new ID if it does not match any previous record, or returns the ID if it does match a record. The key is how to compute this characteristic vector. CNN features (intermediate layer) can be one. Gaussian mixtures of color of the detected human patch can be another.
It is still a very active research field and I think it would be quite hard to make a accurate one if you don't have much resources/time at hand.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I will state up front, I am not a Data Scientist, but have the wherewithal to learn what I need to know. However, I need advice as to where to look and the appropriate algorithms to study.
The problem as as follows. I have 10 years of 1 hour observations of output from a sensor. For the argument, let's use the output of a Weather Station and more specifically, a solar panel, in the form of a float in milliVolts.
You might argue that if a 24 Hour subset of data from this Time Series (24 Points) were taken as a matrix for comparison to the historical Time Series, one could identify "sunny" days in the past. If we were to take the latest 24 hrs of data as a comparison, we might be able to identify days that were "similar" to today and thereby taking the next subsequent matrix from a matched position, "predict" what is going to happen tomorrow, from historical action.
This is of course a rough analogy, but illustrates my problem.
I wish to take an arbitrary 24 hr period from the Time Series (Lets call this Matrix a) and identify from the Time Series (000s of Matrices) those 24 hr periods that are similar.
I have reviewed a lot around this subject in the form of various types of Regression and at one stage identified that the Data Compression algorithms would be the most effective, if you could source the subsequent dictionary made from the process, however, I realised the matching in this case is "exact" and I wish for "similar".
I have settled on what I believe to be correct, "L1 Penalty and Sparsity in Logistic Regression" located here.
Where I (if I understand correctly) take a comparison Matrix, compare it to others and get a score for "similarity" (In this case called C). From here I can carry on with my experiment.
If some kind hearted Data Scientist might me a favor and 1. Confirm my direction effective or, if not 2. Point me to where I might find the process to answer my problem, I would be eternally grateful.
Many thanks in advance
ApteryxNZ
For timeseries forecasting (prediction), you can search about LSTM neural network, SVM and even MLP. I've seen timeseries forecasting with simpler classifiers, such as AODE.
To filter the data (if applicable) that you will input to your timeseries you can search about Granger causality, Particle Sworm optimizations and even genetic algorithms
For finding similar patterns in the timeseries, I think your best option is using Dynamic Time Warping (DTW) used for speech recognition
You can search about related work in some journals such as:
Pattern Recognition Letters
Pattern Recognition
Neurocomputing
Applied Soft Computing
Information Sciences
Machine Learning
Neural Networks
IEEE Transaction on Neural Networks and Learning Systems
Note that this really depends how you define "similar".
One simple way would be the "nearest neighbors" approach: treat your data points as 24-dimensional vectors, then find the ones with the shortest Euclidean (or Manhattan or…) distance to your goal point. Those are the most similar days. (k-d trees can speed up this process significantly.)
But, 24 dimensions might be too much for your purposes. Principal Component Analysis (PCA) can reduce them from 24-dimensional points to some lower number of dimensions, while preserving the variation as much as possible. Then finding the closest points will be much faster.
Note that both of these methods will only work if you're comparing value-by-value, that is, if you don't consider "the same but one hour later" to be particularly similar.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
My initial practice was rapid prototyping, did a random classifier, 1R classifier, tried some feature engineering, linear regression, logistic regression, naive bayes...etc etc.
I did all of this in a jumbled-up, incoherent manner however. What I would like to know is, if you were given a set of data (for the sake of argument, the data posted above) how would you analyse it to find a suitable classifier? What would you look at to extract meaning from your dataset initially?
Is what I have done correct in this age of high-level programming where I can run 5/6 algorithms on my data in a night? Is a rapid prototyping approach the best idea here or is there a more reasoned, logical approach that can be taken?
At the moment, I have cleaned up the data, removing all the meaningless rows (there is a small amount of these so they can just be discarded). I have written a script to cross validate my classifier, so I have a metric to test for bias/variance and also to check overall algorithm performance.
Where do I go from here? What aspects do I need to consider? What do I think about here?
You could throw in some elements of theory. For example:
the naive bayes classifier assumes that all variables are independent. Maybe that's not the case?
But this classifier is fast and easy, so it's still a good choice for many problems, even if the variables are not really independent.
the linear regression gives too much weight on samples that are far away from the classification boundary. That's usually a bad idea.
the logistic regression is an attempt to fix this problem, but still assumes a linear correlation between the input variables. In other words, the boundary between the classes is a plane in the input variable space.
When I study a dataset, I typically start by drawing the distribution of each variable for each class of samples to find the most discriminating variables.
Then, for each class of samples, I usually plot a given input variable versus another to study the correlations between the variables: are there non-linear correlations? if yes, I might choose classifiers that can handle such correlations.
Are there strong correlations between two input variables?
if yes, one of the variables could be dropped to reduce the dimensionality of the problem.
These plots will also allow you to spot problems in your dataset.
But after all, trying many classifiers and optimizing their parameters for best results in the cross validation as you have done is a pragmatic and valid approach, and this has to be done at some point anyway.
I understand from the tags in this post that you have used the classifiers of scikit-learn.
In case you have not noticed yet, this package provides powerful tools for cross validation as well http://scikit-learn.org/stable/modules/cross_validation.html