Linear regression containing Nans

Linear regression containing Nans - python

I'm working on a model which will predict a number from others opinion. For this i will use Linear Regression from Sklearn.
For example, i have 5 agents from witch i collect data over time of theirs last changes in each iteration, if they didn't insert it yet, data contains Nan, till their first change. Data looks something like this:
a1 a2 a3 a4 a5 target
1 nan nan nan nan 3 4.5
2 4 nan nan nan 3 4.5
3 4 5 nan nan 3 4.5
4 4 5 5 nan 3 4.5
5 4 5 5 4 3 4.5
6 5 5 5 4 3 4.5
So in each iteration/change i want to predict end number. As we know linear regression doesn't allow you to have an = Nan's in data. I replace them with an = 0, witch doesn't ruin answer, because formula of linear regression is: result = a1*w1 + a2*w2 + ... + an*wn + c.
Current questions i have at the moment:
Does my solution somehow effects on fit? Is there any better solution for my problem? Should i learn my model only with full data than use it with current solution?

Setting nan's to 0 and training a linear regression to find coefficients for each of the variables is fine depending on the use case.
Why?
You are essentialy training the model and telling it that for many rows - the importance of variable a1 ,a2 , etc (when the value is nan and set to 0).
If the NAN's are because of data not being filled in yet, then setting them to 0 and training your model is wrong. It's better to train your model after all the data has been entered (atleast for all the agents who have entered some data) This can later be used to predict for new agents. Else , your coefficients will be over fit for 0's(NAN's) if many agents have not yet entered in their data.
Based on the end target(which is a continuous variable) , linear regression is a good approach to go by.

Related

Forecasting future occurrences with Random Forest

I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:
Count Year Month Day Count-1 Count-2 Friday Monday Saturday Sunday Thursday Tuesday Wednesday
196.0 2017.0 7.0 10.0 196.0 196.0 0 1 0 0 0 0 0
264.0 2017.0 7.0 11.0 196.0 196.0 0 0 0 0 0 1 0
274.0 2017.0 7.0 12.0 264.0 196.0 0 0 0 0 0 0 1
286.0 2017.0 7.0 13.0 274.0 264.0 0 0 0 0 1 0 0
502.0 2017.0 7.0 14.0 286.0 274.0 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)
predictions = rf.predict(test_features)
The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.
First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.
Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.
If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks

Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.
Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.
forecasting = rf.predict(dataset_to_be_forecasted)

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \
0 2 2017-04-01 00:03:54 2017-04-01 00:20:51 N
1 2 2017-04-01 00:00:29 2017-04-01 00:02:44 N
RatecodeID PULocationID DOLocationID passenger_count trip_distance \
0 1 25 14 1 5.29
1 1 263 75 1 0.76
fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \
0 18.5 0.5 0.5 1.00 0.0 NaN
1 4.5 0.5 0.5 1.45 0.0 NaN
improvement_surcharge total_amount payment_type trip_type
0 0.3 20.80 1 1.0
1 0.3 7.25 1 1.0
There are five different 'payment_type', indicated by numerical number 1,2,3,4,5
I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:
for i in range(1,6):
print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))
gives:
tip_amount payment_type
0 1.00 1
1 1.45 1
tip_amount payment_type
5 0.0 2
8 0.0 2
tip_amount payment_type
100 0.0 3
513 0.0 3
tip_amount payment_type
59 0.0 4
102 0.0 4
tip_amount payment_type
46656 0.0 5
53090 0.0 5
First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?
Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?
Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.

A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.
First question:
Yes
Second question:
Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.
Third question:
If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.
Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.

Pandas Concat Different Sized DataFrame to End of Column

Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.

Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.

Predictive modelling - Regression with grouped id's and moving average (python)

I have a problem with a predictive modelling problem. Hopefully, someone have time and can help me. The starting position is shown below. S1-S2 are sensor measurements and RUL is my target value.
DataStructure:
id period s1 s2 s3 RUL
1 1 510.23 643.43 1585.29 6
1 2 512.34 644.89 1586.12 5
1 3 514.65 645.11 1587.99 4
1 4 512.98 647.59 1588.45 3
1 5 516.34 649.04 1590.65 2
1 6 518.12 652.62 1593.09 1
2 1 509.77 640.61 1584.91 9
2 2 510.26 642.06 1586.00 8
2 3 511.95 643.62 1588.09 7
2 4 513.51 646.51 1589.45 6
2 5 512.17 648.06 1589.54 5
2 6 515.56 646.11 1586.22 4
2 7 518.78 649.34 1586.96 3
2 8 519.90 650.30 1588.95 2
2 9 521.05 651.39 1591.34 1
3 1 501.11 653.99 1580.45 8
3 2 511.45 643.23 1584.09 7
3 3 505.45 643.78 1586.11 6
3 4 504.45 643.43 1588.34 5
3 5 506.45 643.71 1589.89 4
3 6 511.45 643.33 1591.21 3
3 7 516.45 643.61 1592.42 2
3 8 518.45 643.05 1596.77 1
Target:
My target is to predict the remaining usefull live (RUL) of unseen data. In this case I have only 1 type of machine with different id's (That means 1 type and 3 different physical systems). For the prediction the id doesn't matter, because it's the same machine. Furthermore, I want to add new features. The moving average of s1 s2 and s3. So I have to add three new column with the names a1, a2 and a3.
For instance, a1 should look like:
a1
NaN
NaN
512.41
513.32
514.66
515.81
NaN
NaN
510.66
511.91
512.54
513.75
515.50
518.08
519.91
NaN
NaN
506.00
507.12
505.45
507.45
511.45
515.45
The next problem is, that I can't work with NaN, because it's a string. How can I ignore/work with it for a1, a2, and a3 ?
Next Question is: How can I use a regression models like RandomForest and Bagged Decision Trees with a train_test_split to predict the RUL of unseen new data? (Of course I need more data, this example gives just the structure.) [s1],[s2],[s3] are my inputs and RUL is the output.
Furthermore, I want to evaluate the model with Mean Absolut Error, Mean Squared Error and R².
Finaly, I want to use the gridsearch Method for tuning.
Thank you:
Thank you in advance. I know what I wanna do, but I'm not able to realize it with python. A complete code would be perfect.

The standard way of solving this problem is through imputation. SciKitLearn has a built in package for imputation. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
There are 3 strategies for replacing the NaNs:
1) replacing it with the mean of the column
2) replacing it with the mode of the column
3) replacing it with the median of the column
An example of usage would look like this:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy = 'mean', axis = 1)
a1 = Imputer.fit_transform(a1, strategy = 'mean')
There are also usage examples available here: http://scikit-learn.org/stable/modules/preprocessing.html#imputation

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1
Time A1 A2 A3 B1 B2 B3
1 1.00 6.64 6.82 6.79 6.70 6.95 7.02
2 2.00 6.70 6.86 6.92 NaN NaN NaN
3 3.00 NaN NaN NaN 7.07 7.27 7.40
4 4.00 7.15 7.26 7.26 7.19 NaN NaN
5 5.00 NaN NaN NaN NaN 7.40 7.51
6 5.50 7.44 7.63 7.58 7.54 NaN NaN
7 6.00 7.62 7.86 7.71 NaN NaN NaN
This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:
from sklearn.linear_model import LinearRegression
series = np.array([]) #blank list to append result
df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]
series= np.concatenate((SGR_trips, m), axis = 0)
As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.
I tried using a for loop such as:
for col in df1.columns:
and replacing 'A1', for example with col in the code, but this does not seem to be working.
Is there any way I can do this more efficiently?
Thank you!

One liner (or three)
time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
['Slope'], df.columns)
Broken down with a bit of explanation
Using the closed form of OLS
In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.
is np.linalg.pinv(time.T.dot(time)).dot(time.T)
Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.
Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.
Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!

Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:
slopes = []
for c in cols:
if c=="Time": break
mask = ~np.isnan(df1[c])
x = np.atleast_2d(df1.Time[mask].values).T
y = np.atleast_2d(df1[c][mask].values).T
reg = LinearRegression().fit(x, y)
slopes.append(reg.coef_[0])
I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.