Train machine learning model with scikit learn for time-series prediction - python

I need to train a model with scikit-learn to predict possible time for less people in a room.
Here is how my dataset looks like:
Time PeopleCount
---------------------------------------------
2019-12-29 12:40:10 50
2019-12-29 12:42:10 30
2019-12-29 12:44:10 10
2019-12-29 12:46:10 10
2019-12-29 12:48:10 80
and so on...
This data will be available for 30 days.
Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"
What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?
I am new to machine learning, sorry for a naive question.

First of all, time-series prediction is on the base of theory that current value more or less depend on the past ones. For instance, 80 of people count as of 2019-12-29 12:48:10 has to be strongly influenced on the people count at the time of 12:46:10, 12:44:20 or previous ones, correlating with past values. If not, you would be better off using the other algorithm for prediction.
While the scikit package contains a various modules as the machine learning algorithm, most of them specialize in the classification algorithm. I think the classification algorithm certainly satisfy your demand if your date is not identified as the type of time series. Actually, scikit also has some regression module, even though I think that seem not to be well suitable for prediction of time series data.
In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.

SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.
Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:
split your data into training and validation sets,
train each model that you want to try and compare the results.
Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.
If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.
There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.

Related

Anomaly Testing - Linear Regression with t or not with t? Problems to understand the setup

If you want to check an anomaly in stock data many studies use a linear regression. Let's say you want to check if there is a Monday effect, meaning that monday is significantly worse than other days.
I understood that we can use a regression like: return = a + b DummyMon + e
a is the constant, b the regression coefficient, we have the Dummy for Monday and the error term e.
That's what I used in python:
First you add a constant to the anomaly:
anomaly = sm.add_constant(anomaly)
Then you build the model:
model = sm.OLS(return, anomaly)
The you fit the model:
results = model.fit()
I wonder if this is the correct model setup.
In this case a plot of the linear regression would just show two vertical areas above 0 (for no Monday) and 1 (for Monday) with all the returns. It looks pretty strange. Is this correct?
Should I somehow try to use the time (t) in the regression? If so, how can I do it with python? I thought about giving each date an increasing number, but then I wondered how to treat weekends.
I would assume that with many data points both approaches are similar, if the time series is stationary, right? In the end I do a cross section anaylsis and don't care about the aspect of the time series in this case, correct? ( I heard about GARCH models etc, where this is a different)
Well, I am just learning and hope someone could give me some ideas about the topic.
Thank you very much in advance.
For time series analysis tasks (such as forecasting or anomaly detection), you may need a more advanced model, such as Recurrent Neural Networks (RNN) in deep learning. You can assign any time step to an RNN Cell, in your case, every RNN Cell can represent a day or maybe an hour or half a day etc.
The main purpose of the RNNs is to make the model understand the time dependencies in the data. For example, if monday has a bad affect, then corresponding RNN Cells will have reasonable parameters. I would recommend you to do some further research about it. Here there are some documentations that may help:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
(Also includes different types of RNN)
https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e
And you can use tensorflow, keras or PyTorch libraries.

How can you do time series forecasting in Tensorflow (or with other tools) where features of the label timestep are known?

This is a question about a general approach rather than a specific coding problem. I'm trying to do time series forecasting with Tensorflow where features of the label timestep are known to the model. E.g. a human trying to predict a variable a week from now would know things that are going to happen in the next week that will affect that variable. So a window of 20 timesteps where the label is the 20th timestep would look something like this:
Timesteps 1-19 would each have a set of features plus the timeseries data
Timestep 20 would have a set of features which are known, plus the timeseries label which is unknown
Is there a model that could handle this sort of data? I've gone through the Tensorflow time series forecasting tutorial, done a Coursera course on Tensorflow time series forecasting and searched elsewhere but I can't find anything. I'm fairly new to this so apologies for any imprecise language.
I once tried to do this kind of TS problem by stacking a multivariate model and another machine learning model. My idea was that I use the normal TS model's output, add it as another feature in the other model that only takes the last time step's info as input. But it is complicated and might overfit a lot even if I carefully regularized the second model. The idea is that I use step 1 to window_size - 1 info to predict a rough output at step window_size, then use the info at step window_size to reduce the residual between my TS model output and the actual label; But I don't think this approach is theoretically correct and the result might be worse than using a TS model without feeding the target step's info.
I don't think tensorflow have any API for your problem because this type of problem is not a normal TS problem. Usually people would just treat this kind of problem as a regression or classification problem.
I am not an expert on this problem as well, but I just happened to attempt to solve the exact problem so this is just my personal experience...

Classifier for time based data to binary label

I have access to a dataframe of 100 persons and how they performed on a certain motion test. This frame contains about 25,000 rows per person since the performance of this person is kept track of (approximately) each centisecond (10^-2). We want to use this data to predict a binary y-label, that is to say, if someone has a motor problem or not.
Trained neural networks on mean's and variances of certain columns per person classified +-72% of the data correctly.
Naive bayes classifier on mean's and variances of certain columns per person classified +-80% correctly.
Now since this is time based data, 'performance on this test through time', we were suggested to use Recurrent Neural Networks. I've looked into this and I find that this is mostly used to predict future events, i.e. the events happening in the next centiseconds.
Question is, is it in general feasible to use RNN's on (in a way time-based) data like this to predict a binary label? If not, what is?
Yes it definitely is feasible and also very common. Search for any document classification tasks (e.g. sentiment) for examples of this kind of tasks.

Would a Logistic Regression Machine Learning Model Work Here?

I am in 10th grade and I am looking to use a machine learning model on patient data to find a correlation between the time of week and patient adherence. I have separated the week into 21 time slots, three for each time of day (1 is Monday morning, 2 is monday afternoon, etc.). Adherence values will be binary (0 means they did not take the medicine, 1 means they did). I will simulate training, validation and test data for my model. From my understanding, I can use a logistic regression model to output the probability of the patient missing their medication on a certain time slot given past data for that time slot. This is because logistic regression outputs binary values when given a threshold and is good for problems dealing with probability and binary classes, which is my scenario. In my case, the two classes I am dealing with are yes they will take their medicine, and no they will not. But the major problem with this is that this data will be non-linear, at least to my understanding. To make this more clear, let me give a real life example. If a patient has yoga class on Sunday mornings, (time slot 19) and tends to forget to take their medication at this time, then most of the numbers under time slot 19 would be 0s, while all the other time slots would have many more 1s. The goal is to create a machine learning model which can realize given past data that the patient is very likely going to miss their medication on the next time slot 19. I believe that logistic regression must be used on data that still has an inherently linear data distribution, however I am not sure. I also understand that neural networks are ideal for non-linear distributions, but neural networks require a lot of data to function properly, and ideally the goal of my model is to be able to function decently with simply a few weeks of data. Of course any model becomes more accurate with more data, but it seems to me that generally neural networks need thousands of data sets to truly become decently accurate. Again, I could very well be wrong.
My question is really what model type would work here. I do know that I will need some form of supervised classification. But can I use logistic regression to make predictions when given time of week about adherence?
Really any general feedback on my project is greatly appreciated! Please keep in mind I am only 15, and so certain statements I made were possibly wrong and I will not be able to fully understand very complex replies.
I also have to complete this within the next two weeks, so please do not hesitate to respond as soon as you can! Thank you so much!
In my opinion a logistic regression won't be enough for this as u are going to use a single parameter as input. When I imagine a decision line for this problem, I don't think it can be achieved by a single neuron(a logistic regression). It may need few more neurons or even few layers of them to do so. And u may need a lot of data set for this purpose.
It's true that you need a lot of data for applying neural networks.
It would have been helpful if you could be more precise about your dataset and the features. You can also try implementing K-Means-Clustering for your project. If your aim is to find out that did the patient took medicine or not then it can be done using logistic regression.

Tensorflow: predict grow factor in a time series forecast

I am developing an application to predict future hourly online orders on my e-commerce website (time-series problem) using Canned Estimator tf.estimator.DNNRegressor
estimator = tf.estimator.DNNRegressor(
feature_columns=my_feature_columns,
hidden_units=hidden_units,
model_dir=model_dir,
optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.01,
l1_regularization_strength=0.001))
The features I am using are pretty much based on the date and time. For example, the csv file from my training data looks like this
year,month,day,weekday,isweekend,hr,weeknum,yearday,orders
2018,7,16,2,0,0,29,197,193
2018,7,16,2,0,1,29,197,131
2018,7,16,2,0,2,29,197,77
2018,7,16,2,0,3,29,197,59
.....
where orders column is the target for the model.
The model I got so far is working decently but when I run predictions for a high demand day like Black Friday, it is under-predicting. For example, in the graph below we can see that predictions for Black Friday this year 2018 (dashed line) are not as high as we intuitively expect, even though it predicts the shape nicely.
With that all being said, I would appreciate any recommendation to add to my model so it can also predict correctly the grow factor and not only the trend.
This is a time series problem, so you're better off using tf.contrib.timeseries.ARRegressor (neural network built specifically for time series) or tf.contrib.timeseries.StructuralEnsembleRegressor (time series state space model - which ) than a generic neural network.
Both models take an exogenous_feature_columns argument, you could populate that with 0 for normal days and 1 for event days like Black Friday. That would fix your under-predicting problem since otherwise the model would treat those spikes as outliers (you could do this even with a generic neural network - it's just easier to code with the time series specific functions).
On a more general note, I would recommend other tools besides tensorflow for time series forecasting, such as Facebook Prophet or Statsmodels package.
I would go further and recommend that you don't use Python at all, and instead look at using some of the forecasting packages available in R.

Categories