Machine learning classification dataset setup

Machine learning classification dataset setup - python

I am very sorry if this question violates SO's question guidelines but I am stuck and I cannot find anywhere else to ask this type of questions. Suppose I have a dataset containing three experimental data that were obtained in three different conditions (hot, cold, comfortable). The data is arranged in three columns in a pandas dataframe consisting of 4 columns (time, cold, comfortable and hot).
When I plot the data, I can visually see the separation of the three experiments, but I would like to do it automatically with machine learning.
The x-axis represents the time and the y-axis represents the magnitude of the data. I have read about different machine learning classification techniquesbut I do not understand how to set up my data so that I can 'feed' it into the classification algorithm. Namely, my questions are:
Is this programmatically feasible?
How can I set up (arrange my data) so that it can be easily fed into the classification algorithm? From what I read so far, it seems, for the algorithm to work, the data has to be in a certain order (see for example the iris dataset where the data is nicely labeled. How can I customize the algorithms to fit my needs?
NOTE: Ideally, I would like the program that, given a magnitude value, it would classify the value as hot, comfortable or cold. The time series is not much of relevance in my case

Of course this is feasible.
It's not entirely clear from the original post exactly what variables/features you have available for your model, but here is a bit of general guidance. All of these machine learning problems, from classification to regression, rely on the same core assumption that you are trying to predict some outcome based on a bunch of inputs. Usually this relationship is modeled like this: y ~ X1 + X2 + X3 ..., where y is your outcome ("dependent") variable, and X1, X2, etc. are features ("explanatory" variables). More simply, we can say that using our entire feature-set matrix X (i.e. the matrix containing all of our x-variables), we can predict some outcome variable y using a variety of ML techniques.
So in your case, you'd try to predict whether it's Cold, Comfortable, or Hot based on time. This is really more of a forecasting problem than it is a ML problem, since you have a time component that looks to be one of the most important (if not the only) features in your dataset. You may want to look at some simpler time-series forecasting methods (e.g. ARIMA) instead of ML algorithms, as some of the time-series ML approaches may not be well-suited for a beginner.
In any case, this should get you started, I think.

Related

How to Make statistical tests in time series applications

I received a feedback from my paper about stock market forecasting with Machine Learning, and the reviewer asked the following:
I would like you to statistically test the out-of-sample performance
of your methods. Hence 'differ significantly' in the original wording.
I agree that some of the figures look awesome visually, but visually,
random noise seems to contain patterns. I believe Sortino Ratio is the
appropriate statistic to test, and it can be tested by using
bootstrap. I.e., a distribution is obtained for both BH and your
strategy, and the overlap of these distributions is calculated.
My problem is that I never did that for time series data. My validation procedure is using a strategy called walk forward, where I shift data in time 11 times, generating 11 different combinations of training and test with no overlap. So, here are my questions:
1- what would be the best (or more appropriate) statistical test to use given what the reviewer is asking?
2- If I remember well, statistical tests require vectors as input, is that correct? can I generate a vector containing 11 values of sortino ratios (1 for each walk) and then compare them with baselines? or should I run my code more than once? I am afraid the last choice would be unfeasible given the sort time to review.
So, what would be the correct actions to compare machine learning approaches statistically in this time series scenario?

Pointing out random noise seems to contain patterns, It's mean your plots have nice patterns, but it's might be random noise following [x] distribution (i.e. random uniform noise), which make things less accurate. It might be a good idea to split data into a k groups randomly, then apply Z-Test or T-test, pairwise compare the k-groups.
The reviewer point out the Sortino ratio which seems to be ambiguous as you are targeting to have a machine learning model, for a forecasting task, it's meant that, what you actually care about is the forecasting accuracy and reliability which could be granted if you are using Cross-Vaildation, in convex optimization it's equivalent to use the sensitivity analysis.
Update
The problem of serial dependency for time series data, raised in case of we have non-stationary time series data (low patterns), which seems to be not the problem of your data, even if it's the case, it's could be solved by removing the trends, i.e. convert non-stationery time series into stationery, using ADF Test for example, and might also consider using ARIMA models.
Time shifting, sometimes could be useful, but it's not considered to be a good measurement of noises, but it's might help to improve model accuracy by shifting data and extracting some features (ex. mean, variance over window size, etc.).
There's nothing preventing you to try time shifting approach, but you can't rely on it as an accurate measurement and you still need to prove your statistical analysis, using more robust techniques.

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.

If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Multidimensional Prediction

I have a list of temporal series of values measured in different places. These measurements may or may not be correlated, (mostly depending on their relative positions, but it is plausible that some very close detectors would actually measure decorrelated series). I would like to predict the values of the whole set, taking into account the series of all of them and their correlation through time. If it is of any help, the values should also have relative periodicity
EDIT: I have access to the generated power of several solar panels. These solar panels are spread spatially, and I would like to use them as 'irradiance detectors'. Knowing the sun illumination in several places in the past, I wish to identify correlations in between signals, which could then be used to make predictions of illumination.
Regardless of usual patterns of production through a day (as seen on image), what I am interested in is the information I can extract from one pannels' past to predict another ones future.
I think I would need a Neural Network to solve this problem, but I am not sure how to feed it :I thought of using a temporal window and feed my NN with a few past values from A, B and C, but I am afraid it's a little weak.
The image shows an example of what my data I looks like.
How can I predict the next values of curve A knowing past values of A, B and C?
How to handle this prediction?

I think the easiest way is to train 3 models with the same input but each will predict one value (A, B or C).
If you are sure about correlation between input variable and their impact on the predicted output, you may create one neural network with a common branch (probably RNN over the stacked 3 inputs) then 3 different prediction head where each will produce one prediction A or B or C. Fast-rcnn architecture is a great example of this.

The best way to achieve this task is to use a RNN.
A good tutorial for learning how to develop such a neural network is here :
https://www.tensorflow.org/tutorials/recurrent
I also found this link, where they achieved training a RNN for a rather close problem :
http://blog.datatonic.com/2016/11/traffic-in-london-episode-ii-predicting.html
An even better inspiration :
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

Multi-Output Classification using scikit Decision Trees

Disclaimer: I'm new to the field of Machine Learning, and even though I have done my fair share of research during the past month I still lack deep understanding on this topic.
I have been playing around with the scikit library with the objective of learning how to predict new data based on historic information, and classify existing information.
I'm trying to solve 2 different problems which may be correlated:
Problem 1
Given a data set containing rows R1 ... RN with features F1 ... FN, and a target per each group of rows, determine in which group does row R(N+1) belongs to.
Now, the target value is not singular, it's a set of values; The best solution I have been able to come up with is to represent those sets of values as a concatenation, this creates an artificial class and allows me to represent multiple values using only one attribute. Is there a better approach to this?
What I'm expecting is to be able to pass totally new set of rows, and being told which are the target values per each of them.
Problem 2
Given a data set containing rows R1 ... RN with features F1 ... FN, predict the values of R(N+1) based on the frequency of the features.
A few considerations here:
Most of the features are categorical in nature.
Some of the features are dates, so when doing the prediction the date should be in the future relative to the historic data.
The frequency analysis needs to be done per row, because certain sets of values may be invalid.
My question here is: Is there any process/ML algorithm, which given historic data would be able to predict a new set of values based on just the frequency of the parameters?
If you have any suggestions, please let me know.

Regarding Problem 1, if you expect the different components of the target value to be independent, you can approach the problem as building a classifier for every component. That is, if the features are F = (F_1, F_2, ..., F_N) and the targets Y = (Y_1, Y_2, ..., Y_N), create a classifier with features F and target Y_1, a second classifier with features F and target Y_2, etc.
Regarding Problem 2, if you are not dealing with a time series, IMO the best you can do is simply predict the most frequent value for each feature.
That said, I believe your question fits better another stack exchange like cross-validated.

Impute multiple missing values in a feature-vector

Edited post
This is a short and somewhat clarified version of the original post.
We've got a training dataset (some features are significantly correlated). The feature space has 20 dimensions (all continuous).
We need to train a nonparametric (most features form nonlinear subspaces and we can't assume a distribution for any of them) imputer (kNN or tree-based regression) using the training data.
We need to predict multiple missing values in query data (a query feature-vector can have up to 13 missing features, so the imputer should handle any combination of missing features) using the trained imputer. NOTE the imputer should not be in any way retrained/fitted using the query data (like it is done in all mainstream R packages I've found so far: Amelia, impute, mi and mice...). That is the imputation should be based solely on the training data.
The purpose for all this is described below.
A small data sample is down below.
Original post (TL;DR)
Simply put, I've got some sophisticated data imputing to do. We've got a training dataset of ~100k 20D samples and a smaller testing dataset. Each feature/dimension is a continuous variable, but the scales are different. There are two distinct classes. Both datasets are very NA-inflated (NAs are not equally distributed across dimensions). I use sklearn.ensemble.ExtraTreesClassifier for classification and, although tree ensembles can handle missing data cases, there are three reasons to perform imputation
This way we get votes from all trees in a forest during classification of a query dataset (not just those that don't have a missing feature/features).
We don't loose data during training.
scikit implementation of tree ensembles (both ExtraTrees and RandomForest) do not handle missing values. But this point is not that much important. If it wasn't for the former two I would've just used rpy2 + some nice R implementation.
Things are quite simple with the training dataset because I can apply class-specific median imputation strategy to deal with missing values and this approach has been working fine so far. Obviously this approach can't be applied to a query - we don't have the classes to begin with. Since we know that the classes will likely have significantly different shares in the query we can't apply a class-indifferent approach because that might introduce bias and reduce classification performance, therefore we need to impute missing values from a model.
Linear models are not an option for several reasons:
all features are correlated to some extent;
theoretically we can get all possible combinations of missing features in a sample feature-vector, even though our tool requires at least 7 non-missing features we end up with ~1^E6 possible models, this doesn't look very elegant if you ask me.
Tree-based regression models aren't good for the very same reason. So we ended up picking kNN (k nearest neighbours), ball tree or LSH with radius threshold to be more specific. This approach fits the task quite well, because dimensions (ergo distances) are correlated, hence we get nice performance in extremely NA-rich cases, but there are several drawbacks:
I haven't found a single implementation in Python (including impute, sklearn.preprocessing.Imputer, orange) that handles feature-vectors with different sets of missing values, that is we want to have only one imputer for all possible combinations of missing features.
kNN uses pair-wise point distances for prediction/imputation. As I've already mentioned our variables have different scales, hence the feature space must be normalised prior to distance estimations. And we need to know theoretic max/min values for each dimension to scale it properly. This is not as much of a problem, as it is a matter architectural simplicity (a user will have to provide a vector of min/max values).
So here is what I would like to hear from you:
Are there any classic ways to address the kNN-related issues given in the list above? I believe this must be a common case, yet I haven't found anything specific on the web.
Is there a better way to impute data in our case? What would you recommend? Please, provide implementations in Python (R and C/C++ are considered as well).
Data
Here is a small sample of the training data set. I reduced the number of features to make it more readable. The query data has identical structure, except for the obvious absence of category information.
v1 v2 v3 v4 v5 category
0.40524 0.71542 NA 0.81033 0.8209 1
0.78421 0.76378 0.84324 0.58814 0.9348 2
0.30055 NA 0.84324 NA 0.60003 1
0.34754 0.25277 0.18861 0.28937 0.41394 1
NA 0.71542 0.10333 0.41448 0.07377 1
0.40019 0.02634 0.20924 NA 0.85404 2
0.56404 0.5481 0.51284 0.39956 0.95957 2
0.07758 0.40959 0.33802 0.27802 0.35396 1
0.91219 0.89865 0.84324 0.81033 0.99243 1
0.91219 NA NA 0.81033 0.95988 2
0.5463 0.89865 0.84324 0.81033 NA 2
0.00963 0.06737 0.03719 0.08979 0.57746 2
0.59875 0.89865 0.84324 0.50834 0.98906 1
0.72092 NA 0.49118 0.58814 0.77973 2
0.06389 NA 0.22424 0.08979 0.7556 2

Based on the new update I think I would recommend against kNN or tree-based algorithms here. Since imputation is the goal and not a consequence of the methods you're choosing you need an algorithm that will learn to complete incomplete data.
To me this seems very well suited to use a denoising autoencoder. If you're familiar with Neural Networks it's the same basic principle. Instead of training to predict labels you train the model to predict the input data with a notable twist.
The 'denoising' part refers to a intermediate step where you randomly set some percentage of the input data to 0 before attempting to predict it. This forces the algorithm to learn more rich features and how to complete the data when there are missing pieces. In your case I would recommend a low amount of drop out in training (since your data is already missing features) and no dropout in test.
It would be difficult to write a helpful example without looking at your data first, but the basics of what an autoencoder does (as well as a complete code implementation) are covered here: http://deeplearning.net/tutorial/dA.html
This link uses a python module called Theano which I would HIGHLY recommend for the job. The flexibility the module trumps every other module I've looked at for Machine Learning and I've looked at a lot. It's not the easiest thing to learn, but if you're going to be doing a lot of this kind of stuff I'd say it's worth the effort. If you don't want to go through all that then you can still implement a denoising autoencoder in Python without it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.