Classification of accelerometer data - python

I am trying to classify accelerometer data into 4 classes- 1,2,3,4. The training dataset looks like the following-
The training labels are contained in another file and contain labels for only the 10th observation. This is what it looks like-
Now I am not sure how to interpret this. Should I only use the training_labels dataset to train a model? In that case, I don't know why the first dataset is given. Also, using only the second set would lead to a loss of information. I thought of doing a left-outer join on the first dataset with the second and using 'bfill' in df.fillna() to get rid of the Nan values and then use that data to train but I am confused as to whether this is the right approach. I am still a beginner at Machine Learning so any help is appreciated.
EDIT: The data comes from an online course I am doing. It says that- "Because the accelerometers are sampled at high frequency, the labels in train_labels are only provided for every 10th observation

If you can afford to discard 90% of your data you can just use only the observations with labels, you can also take the mean / median x,y,z coordinate of 10 observations with the provided label or use the same label for the for the last 10 observations. Those approaches seem legit to me.
Probably the sampling frequency was unnecessary high and therefore you can assume labels do not change that quickly. But this can also depend on the problem at hand.

Related

SHAP for a single data point, instead of average prediction of entire dataset

I am trying to explain a regression model based on LightGBM using SHAP. I'm using the
shap.TreeExplainer(<lightgbm model>).shap_values(X)
method to get the SHAP values, where X is the entire training dataset. These SHAP values give me comparison of an individual prediction, compared to the average prediction of the entire dataset.
In the online book by Christopher Molnar, section 5.9.4, he mentions that:
"Instead of comparing a prediction to the average prediction of the entire dataset, you could compare it to a subset or even to a single data point."
I have a couple of questions regarding this:
Am I correct to interpret that if, instead of passing the entire training dataset, I pass a subset of say 20 observations, then the SHAP values returned will be relative to the average of these 20 observations? This will be the equivalent of "subset" that Christopher Molnar mentioned in his book
Assuming that the answer to question 1 is yes, what if, instead of generating SHAP values relative to the average of 20 observations, I want to generate SHAP values relative to one specific observation. Christopher Molnar seems to imply that is possible. If it is possible, how do I do that?
Thank you in advance for the guidance!
Yes, but definition of "average" is important. If you supply a "background" dataset, your explanations will be calculated against this background, not against the whole dataset. As far as "relative to the average" of the background, one needs to understand shap values are average marginal contributions over all possible coalitions. So as far as SHAP values are concerned, you fix coalition(s), and the rest is yes, averaged. This allows fitting model once, and then passing different coalitions (with the rest averaged) through the model that was trained only once. This is where SHAP time savings come from.
If you're interested in more you may visit original paper or this blog.
Yes. You supply a single data row as background, for a binary classification e.g., supply another class' data row for explanation, and see which feature, and by how much, changed class output.
Yes. By the mathematical formulation in the original paper, SHAP values are "the contribution of a feature to the difference between the actual prediction and the average prediction". The average prediction, sometimes called the "base value" or "expected model output", is relative to the background dataset you provided.
Yes. You can use a background dataset of 1 sample. The common choices of the background dataset is the training data, one single sample as the reference sample, or even a dataset of all zeros. From the author: “I recommend using either a single background data point, a small random subset of the true background, or for the best performance a set of k-medians (weighted by how many training points they each represent) designed to represent the background succinctly. “
Below are more details to support my answers to the two questions and how 2 can be done. So, why does the "expected model output" depend on the background dataset? To answer this questions, let's walk through how SHAP is done:
Step 1: We create a shap explainer providing two things: a trained prediction model and a background dataset. From the background dataset, SHAP creates an artificial dataset of coalitions. Each coalition is a binary vector representing the permutation of feature combinations, 1 represents a feature being present, and 0 absent. So there are 2^M possible coalitions for M features.
explainer = shap.KernelExplainer(f, background_X)
Step 2: We provide the sample(s) for which we want to compute SHAP values for. SHAP fills in values for this artificial dataset such that present features take original values of that sample, and absent features are filled with a value from the background dataset. Then the prediction is generated for this coalition. If the background dataset has n rows, the absent features are filled n times and the average of the n predictions is used as the prediction of this coalition. If the background dataset has a single sample, then the absent feature is filled with the values of that sample.
shap_values = explainer.shap_values(test_X)
Therefore, the SHAP values are relative to the average prediction of the background dataset.

Best way to scale across different datasets

I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.
Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.
The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.
One possible solution could be like this.
Normalize (pre-process) the dataset A such that the range of each features is within a fixed interval, e.g., between [-1, 1].
Train your model on the normalized set A.
Whenever you are given a new dataset like B:
(3.1.) Normalize the new dataset such that the feature have the same range as they have in A ([-1, 1]).
(3.2) Apply your trained model (step 2) on the normalized new set (3.1).
As you have a one-to-one mapping between set B and its normalized version, then you can see what is the prediction on set B, based on predictions on normalized set B.
Note you do not need to have access to set B in advance (or such sets if they are hundreds of them). You normalize them, as soon as you are given one and you want to test your trained model on it.

Financial Time Series Forecasting with Keras/Tensorflow: Three forecasting methods tried, three poor results had, what am I doing wrong?

I'm working with a dataset of a bunch of different features for stock forecasting for a project to help teach myself programming and to get into machine learning. I am running into some issues with predicting future data (or time series forecasting) and I was hoping someone out there could give me some advice! Any advice or criticism you could provide will be greatly appreciated.
Below I've listed detailed examples of the three implementations I have tried for forecasting time series data. I could definitely be wrong on this but I don't believe this is mechanical code issue because all of the results are consistent despite me re-coding it a few times (the only thing I can really think of here is not using MinMaxScaler correctly. See closing thoughts). It could definitely, however, be a macrocoding mistake.
I didn't post any code for the project here because it was starting to turn into a wall of words and I had three separate examples, but if you have any questions or think it would benefit your assistance to see the code used for any of the below examples or the data used for all of them feel free to let me know and I'll link whatever's needed.
The three forecasting implementations I have tried:
1) - A sliding window implementation. Input data is shifted backwards in timesteps (x-1, x-2...), target data is current timestep (x). Data used for forecasting the first prediction is n-rows of test data shifted in same manner as input data. For every subsequent prediction the oldest timestep is removed and the prediction is appended to front of prediction row, maintaining the same number of total timesteps but progressing forward in time.
2) - Input data is just x, target data is shifted 30 timesteps forward for prediction (y+1, y+2 ... y+30). Attempting to forecast future by taking the first sample of x in test data and predicting 30 steps into the future with it.
3) - Combination of both methods, input data is shifted backward and in the example shown below, 101 timesteps including the present timestep (x-100, x-99 ... x) were used. Target data, similar to implementation 2, is shifted 30 timesteps into the future (y+1, y+2... y+30). With this, I am attempting to forecast the future by taking 101 timesteps of the first n-rows of test data and predicting 30 steps into the future from there.
For all tests, I cut off the end of my dataset at an arbitrary amount (last ~10% of total dataset), split everything before the cutoff into training/validation (80/20) and saved everything after the cutoff for testing & forecasting purposes.
As for network architectures, I've tried a bunch of different ones, from bidirectional LSTM to multi-input CNN / GRU, to a wavenet like CNN implementation and all produce prediction results that are bad in a similar enough way that I feel like this is either a data manipulation problem or a problem of me not understanding how model.predict(), or my model's output works.
The architectures I will be using for each implementation below are:
1) causal dilation CNN
2) two layers LSTM
neural network architecture diagrams here: https://imgur.com/a/cY2RWNG
For every example below the model's weights were tuned by the model training on the training data (first 80% of dataset) and attempting to achieve the lowest validation loss possible using the validation data (last 20% of dataset).
--- First Implementation ---
(unfortunately, there's an image limit on stack overflow for my current rating or whatever so I've put each implementation into its own album)
Implementation 1 - Graphs for both CNN/LSTM: Images 1-7 https://imgur.com/a/36DZCIf
In every training/validation test graph, black represents the actual data and red is the predicted data, in the forecasting predictions blue represents the predictions made and orange represents the actual close price on a longer time scale than the prediction for better scale, all forecast predictions are 30 days into the future.
Using this methodology, and displaying the actual close price against the predicted close price in every instance:
Image 1 - sliding window set up for this implementation using one and two features and a range of numbers for viewing ease
CNN:
(images 2 & 3 description in album)
Image 4 - Sliding window approach of forecasting every feature in the data, with the prediction for close price plotted against the actual close price. The timesteps start at the first row of the cutoff data.
When the first prediction is made I append the prediction to the end of this row and remove the first timestep, repeating for every future timestep I wish to predict.
I really don't even know what to say about this prediction, it's really bad...
LSTM:
(images 5 & 6 description in album)
Image 7 - Sliding window prediction: https://i.imgur.com/Ywf6xvr.png
This prediction seems to be getting the trend somewhat accurately I guess.. But the starting point is really nowhere near the last known data point which is confusing.
--- Second Implementation ---
Implementation 2 - Graphs for both CNN/LSTM: Images 1-7
https://imgur.com/a/3CAk1xc
For this attempt, I made the target prediction many timesteps into the future. With this implementation, the model takes in the current timestep(x) of features and attempts to predict the closing price at y+1, y+2,y+3 etc. There is only one prediction here -- a sequence of time steps into the future.
The same graphing and network conventions as implementation 1 had applied to this too.
Image 1 - Set up of input and target data, using a range and only one or two features for viewing ease.
CNN:
(images 2 & 3 description in album)
Image 4 - Plotting all 30 predictions made from the first row of data features after the cutoff... this is horrible, why is the start again nowhere near the last known data point? I don't understand how it can predict y+1 being so far away from the closing price of x when in every instance of its training y+1 was almost certainly extremely close to x.
LSTM:
(images 5 & 6 description in album)
Image 7 - All 30 predictions into the future made from the first row of cutoff data: Again, both all over the place and the predictions start nowhere near the last actual data point, not sure what else to add.
It's starting to appear that either my CNN implementation is poorly done or LSTM is just a better solution here. Regardless, the predictions and actual forecasting are still terrible so I'll hold my judgment on the network architecture until I get something that looks remotely like an actual forecast prediction.
--- Third Implementation ---
Implementation 3 - Graphs for both CNN/LSTM: Images 1-7
https://imgur.com/a/clcKFF8
This was the final idea I had for forecasting the future and it's essentially a combination of the first two. For this implementation, I take x-n (x, x-1, x-2, x-3) etc., which is similar to the first implementation, and set the target data to y+1, y+2, y+3, which is similar to the second implementation. My goal for predicting with this was the same strategy as the second implementation where I would predict 30 days into the future, but instead of doing so on one timestep of features, I'd do so on many timesteps into the past. I had hoped that this implementation would give the prediction enough supporting data to accurately forecast the future
Image 1 - Input data or "x" and Target data or "y" implementation and set up. I use a range of numbers again. In this example, the input data has 2 features, includes the present timestep (x) and 4 timesteps shifted backward (x-1, x-2, x-3, x-4) and the target data has 5 timesteps into the future (y+1, y+2, y+3, y+4, y+5)
CNN:
(images 2 & 3 description in album)
Image 4 - 30 predictions into the future using 101 timesteps of x
This is probably the worst result yet and that's despite the prediction having way more timesteps back of data to use.
LSTM:
(images 5 & 6 description in album)
Image 7 - 30 predictions on input data row of 101 timesteps.
This actually has some action to it I guess, but it's all over the place, doesn't start near the last actual data point and is clearly not accurate at all.
closing thoughts
I've also tried removing the target variable (close price) from the input data but it doesn't seem to change much and the past n-days of closing data should be available to a model anyway I would think.
Originally I MinMaxScaled all of my data in my pre-processing page and did not inverse_transform any of the data. The results were basically just as bad as the examples above. For the examples above I have min max scaled the prediction, validation & test datasets separately to be within the range of 0.2 - 0.8. For the actual forecasting predictions, I've inverse_transformed the data before plotting it against the actual closing price which was never transformed.
If I am doing something fundamentally wrong in the above examples I would love to know as I'm just starting out and this is my first real programming/machine learning project.
A few other things relating to this that I've come across / tried:
I've experimented briefly with using a stateful model where I reset_states() after every prediction to some moderate success.
I've read that sequence to sequence models could be useful for forecasting time series data but I'm really not sure what that system is designed to do with time series despite reading into it quite a bit and am thus not sure how to implement it or test it out.
I tried bidirectional LSTM because of one random StackOverflow post suggesting it for time series forecasting... the results were very mediocre however and it doesn't seem to make much sense to me in this situation from what I understand of how it works. I've only tried it with the first implementation above though, let me know if it's something to look more into.
Any tips/criticism at all that you could provide would be greatly appreciated, I'm really not sure how to progress from here. Thanks in advance!
I have been through that,for me the sliding window approach with LSTM, NN worked like magic for small time series But on a bigger time series with data coming in on hourly basis for a few years it failed miserably.
Later on I ditched LSTM,GBTs & started implementing algos from statsmodels.tsa, ARIMA, SARIMA most of the time, I'll suggest you to read about them too. Very easy to implement no need to worry about sliding window, moving data few timestamps back, it takes care of all. Just train, tune the parameters & predict for the next timestamps.
Sometimes, I also faced issues like my time series had missing timestamps & data, then I had to impute those values, the frequency on which I trained (hourly,weekly,monthly) was different from the frequency on which I wanted to predict, then I had to bring data in right form too. I faced different frequency issue while visualising on a plot as well.
model=statsmodels.api.tsa.SARIMAX(train_df,order=(1,0,1),seasonal_order=(1,1,0,24))
model = model.fit()
other than the data pre-processing part, imputing missing data, training on right frequency & some logic for parameters tuning, you will need to use just these two lines, your data_frame will have the index in a date format & columns will have time series data

Formatting and combining word frequency with other data machine learning python

I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)
Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.
Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?

Connecting 2 line plots

Hey all I was wondering how could I connect 2 different line plots on the same graph together in matlab?
If that is not ideal then I could combine the 2 dataframes together however then I would need a way to tell it to change the color of the line plot at a certain x point?
I want to indicate were the predicated sales are on the graph. here is a Picture of what my code and graph currently look like(Red is actual/green is predicted)
Here is the link to my ipython notebook https://github.com/neil90/Learning_LinearReg/blob/master/Elantra%20Regression_Practice.ipynb
My original dataset was 50 observations(small I know), I split it into training and test. I got .72 R2 on my test set. So then I looked online to see if I could find the independent variables for 12 months after the dataset and low and behold I was able to, however(I am not sure of the accuracy). I then wanted to predict the sales with my model. Hence the green line.
That is always possible to connect two points using a single plot command both in MATLAB and in Python, such as:
P1 = df1(:,end); % last Point in The First Plot
P2 = df2(:,1); % first Point in The Second Plot
plot([P1(1,1) P2(1,1)],[P1(2,1) P2(2,1)])
But
this is not a good practice, as you said the green graph is your prediction part. Why don't you include more inputs when calculating the predicted points.
So I assume that you have a set of data, and then you divided it in two parts, a part for training and the other part for testing the learned model.
The ratio should be (70% to 30% or 80% to 20%), unless you have a validation part as well.
So after you train your model on the training part of the data, you should check the error of the model (Does it converge?). As I can see your prediction part has a huge error. It cannot even re-generate the sample it has seen before (i.e. the last point on the red graph).
So first try to plot the prediction over the whole training data, to see how accurate the learned model is. I am pretty sure that you need to reduce the learning error for your model by doing more iterations or changing the structure of the model.
Because you have not mentioned enough details most of the ideas are based are assumptions here.

Categories