Forecasting future occurrences with Random Forest

Forecasting future occurrences with Random Forest - python

I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:
Count Year Month Day Count-1 Count-2 Friday Monday Saturday Sunday Thursday Tuesday Wednesday
196.0 2017.0 7.0 10.0 196.0 196.0 0 1 0 0 0 0 0
264.0 2017.0 7.0 11.0 196.0 196.0 0 0 0 0 0 1 0
274.0 2017.0 7.0 12.0 264.0 196.0 0 0 0 0 0 0 1
286.0 2017.0 7.0 13.0 274.0 264.0 0 0 0 0 1 0 0
502.0 2017.0 7.0 14.0 286.0 274.0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.
First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.
Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.
If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks

Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.
Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.
forecasting = rf.predict(dataset_to_be_forecasted)

Related

pandas: how to avoid iterating through rows when you need to verify the data sequentially?

I have a dataframe which is composed of a timestamp and two variables:
A pressure measurement which varies sequentially, representing a specific process batch (in red);
An lab analysis, which represents a measurement that represents each batch. The analysis always occurs at the end of the batch and remains a constant value until a new analysis is made. Caution: not every batch is analyzed and I don't have a flag indicating when the batch started.
I need to create a dataframe which calculates, for each batch, the average, maximum and minimum temperature, and how long it took from start to end (timedelta).
I had an idea to loop through all analysis values from the end to the start, and every time I find a new analysis value OR the pressure dropped below a certain value (since this is a characteristic of the process, all batches starts with low pressure), I'd consider as the batch start (to calculate the timedelta and to define the interval I would consider for taking the pressure min, max, and average).
However, I know it is not effective to loop through all dataframe rows (especially with 1 million rows) so, any ideas?
Dataset sample: https://cl1p.net/5sg45sf5 or https://wetransfer.com/downloads/321bc7dc2a02c6f713963518fdd9271b20201115195604/08c169
Edit: there is no clear/ready indication of when a batch starts in the current data (as someone asked), but you can identify a batch by the following characteristics:
Every batch starts with pressure below 30 and going up fastly (in less than one hour) up to 61.
Then it stabilizes around 65 (the plateau value can be something between 61 and 70) and stays there for at least 2 and a half hours.
It ends with a pressure drop (faster than one hour) to a value smaller than 30.
The cycle repeats.
OBS: you can have smaller/shorter peaks between two valid batches, but it shall not be considered as a batch.
Thanks!

This solution assumes that the batches change when the value of lab analysis changes.
First, I'll plot those changes, so we can get an idea of how frequently it does:
df['lab analysis'].plot()
There are not many changes, so we just need to identify these:
df_shift = df.loc[(df['lab analysis'].diff()!=0) & (df['lab analysis'].diff().isna() == False)]
df_shift
time pressure lab analysis
2632 2020-09-15 19:52:00 356.155 59.7
3031 2020-09-16 02:31:00 423.267 59.4
3391 2020-09-16 08:31:00 496.583 59.3
4136 2020-09-16 20:56:00 625.494 59.4
4971 2020-09-17 10:51:00 469.114 59.2
5326 2020-09-17 16:46:00 546.989 58.9
5677 2020-09-17 22:37:00 53.730 59.0
6051 2020-09-18 04:51:00 573.789 59.2
6431 2020-09-18 11:11:00 547.015 58.7
8413 2020-09-19 20:13:00 27.852 58.5
10851 2020-09-21 12:51:00 570.747 58.9
15816 2020-09-24 23:36:00 553.846 58.7
Now we can run a loop from these few changes, categorize each batch, and then run the descriptive statistics:
index_shift = df_shift.index
i = 0
batch = 1
for shift in index_shift:
df.loc[i:shift, 'batch number'] = batch
batch = batch + 1
i = shift
stats = df.groupby('batch number')['pressure'].describe()[['mean','min','max']]
And compute the time difference and insert on stats as well:
df_shift.loc[0] = df.iloc[0,:3]
df_shift.sort_index(inplace = True)
time_difference = [*df_shift['time'].diff()][1:]
stats['duration'] = time_difference
stats
mean min max duration
batch number
1.0 518.116150 24.995 671.315 1 days 19:52:00
2.0 508.353105 27.075 670.874 0 days 06:39:00
3.0 508.562450 26.715 671.156 0 days 06:00:00
4.0 486.795097 25.442 672.548 0 days 12:25:00
5.0 491.437620 24.234 671.611 0 days 13:55:00
6.0 515.473651 29.236 671.355 0 days 05:55:00
7.0 509.180860 25.566 670.714 0 days 05:51:00
8.0 490.876639 25.397 671.134 0 days 06:14:00
9.0 498.757555 24.973 670.445 0 days 06:20:00
10.0 497.000796 25.561 670.667 1 days 09:02:00
11.0 517.255608 26.107 669.476 1 days 16:38:00
12.0 404.859498 20.594 672.566 3 days 10:45:00

How can I make clusters of time frame?

I have a Pandas Dataframe of Time.
0 2020-08-01 23:59:59
1 2020-08-01 23:59:49
2 2020-08-01 20:52:17
3 2020-08-01 19:02:34
4 2020-08-01 18:38:06
I want to add a column where I want to index by making a cluster. For eg. as follows:
0 2020-08-01 23:59:59 1
1 2020-08-01 23:59:49 1
2 2020-08-01 20:52:17 2
3 2020-08-01 19:02:34 3
4 2020-08-01 18:38:06 3
I have written this for this example as we can see 3 clusters can be made, which are the nearest/closest time stamps.
from sklearn.cluster import KMeans
mat = df['datetime'].values
kmeans = KMeans(n_clusters=3)
kmeans.fit(mat.iloc[:,1:])
y_kmeans = kmeans.predict(mat.iloc[:,1:])
df['cluster'] = y_kmeans
However, the above code also didn't work. Well, I have millions of data and obviously don't know how many clusters should I need to make. I read Elbow Method can be used but not exactly sure how it can be done. Can someone direct how it can be done?

kmeans assumes that you know the number of clusters.
If you want a method that determines the number of clusters algorithmically, you can e.g. use DBSCAN which forms a cluster whenever a group of data points is "close" to each other (closeness determined by the eps parameter). If you have a large number of samples and this is very costly, you can also try to explore any clusters in the data using a smaller (representative) subset of the data.

Predict set of location points at a single time slot

The sample dataset contains Location point of the user.
df.head()
user tslot Location_point
0 0 2015-12-04 13:00:00 4356
1 0 2015-12-04 13:15:00 4356
2 0 2015-12-04 13:30:00 3659
3 0 2015-12-04 13:45:00 4356
4 0 2015-12-04 14:00:00 8563
df.shape
(576,3)
The location points are random and need to predict the next location point of the user for a given time. As the location points are random numbers I need to predict the set of location points at each time slot.
Example:
If I need to predict the location point at tslot 2015-12-04 14:00:00.
my predicted output should be [8563,4356,3659,5861,3486].
My code
time_steps=1
data_dim = X_train.shape[2]
model = Sequential()
model.add(LSTM(data_dim, input_shape=(time_steps,data_dim), activation='relu'))
model.add(Dense(data_dim))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X_train, y_train, epochs=20, batch_size=96)
model.summary()
which helps to to predict 1 location points for each time slot. I would like to know if this is possible and how?

I assume that that this is for gaining some confidence about the predictions.
If this is the case, there are multiple ways to do this. For example, refer to this paper by Amazon on how to predict quantiles, and this paper on how to use a Bayesian framework to obtain uncertainty around the predictions.
If you have other intentions, please clarify.

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.

A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Linear regression containing Nans

I'm working on a model which will predict a number from others opinion. For this i will use Linear Regression from Sklearn.
For example, i have 5 agents from witch i collect data over time of theirs last changes in each iteration, if they didn't insert it yet, data contains Nan, till their first change. Data looks something like this:
a1 a2 a3 a4 a5 target
1 nan nan nan nan 3 4.5
2 4 nan nan nan 3 4.5
3 4 5 nan nan 3 4.5
4 4 5 5 nan 3 4.5
5 4 5 5 4 3 4.5
6 5 5 5 4 3 4.5
So in each iteration/change i want to predict end number. As we know linear regression doesn't allow you to have an = Nan's in data. I replace them with an = 0, witch doesn't ruin answer, because formula of linear regression is: result = a1*w1 + a2*w2 + ... + an*wn + c.
Current questions i have at the moment:
Does my solution somehow effects on fit? Is there any better solution for my problem? Should i learn my model only with full data than use it with current solution?

Setting nan's to 0 and training a linear regression to find coefficients for each of the variables is fine depending on the use case.
Why?
You are essentialy training the model and telling it that for many rows - the importance of variable a1 ,a2 , etc (when the value is nan and set to 0).
If the NAN's are because of data not being filled in yet, then setting them to 0 and training your model is wrong. It's better to train your model after all the data has been entered (atleast for all the agents who have entered some data) This can later be used to predict for new agents. Else , your coefficients will be over fit for 0's(NAN's) if many agents have not yet entered in their data.
Based on the end target(which is a continuous variable) , linear regression is a good approach to go by.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.