ARIMA forecasts constrained to an interval in Python language - python

I'm trying to copy the second exercise ("Forecasts constrained to an interval") in the link below:
https://otexts.com/fpp2/limits.html
What the link does is an ARIMA with forecasts constrained to an interval using a certain logarithmic transformation and then back-transformation at the end. But the example in the link uses R language, and I can't find a similar example for Python no matter how much I search.
Can anyone tell me how I can do the exact same thing described in the link with Python? I'm certain it is possible using the statsmodels library, but I'm not sure how to exactly replicate the transformation constraints.
The standard ARIMA in Python:
from statsmodels.tsa.arima_model import ARIMA
import numpy as np
model = ARIMA(series, order=(0,1,1))
model_fit = model.fit(trend='nc',full_output=True, disp=1)
print(model_fit.summary())
I have a feeling that I need to add something like this somewhere (transformation formula):
series = np.log((series-a)/(b-series))
as well as the back-transformation formula. But since they don't produce explicit errors I can't be sure whether I'm coding it right.
Also, I'm stuck at where I should be adding the transformation and back-transformation. I would appreciate it if someone could explain how the exercise in the link could be replicated in Python.
P.S. By 'transformation' here, it has nothing to do with making the time series stationary. I didn't mention the stationary part because it's unrelated to my current question. The link above uses the word 'transformation' to use the logarithmic formula to make the time series constrained to lie between 'a' and 'b'.
What I tried so far:
series = np.log((series-a)/(b-series))
model = ARIMA(series, order=(0,1,1))
model_fit = model.fit(trend='c',full_output=True, disp=1)
print(model_fit.summary())
fore = model_fit.forecast(steps=1)
fore = (b-a)*np.exp(fore)/(1+np.exp(fore)) + a

it's so clear from the link that you referred to in the question that the transformation is going to take place just before forecasting. so:
you do the transformation on your data
forecast using ARIMA model on transformed data
reverse the transformation on predicted data!
a = 50
b = 400
# Transformation on the data
train = np.log((series-a)/(b-series))
# Choose suitable order
model = ARIMA(train,order=(2,2,2))
results = model.fit()
start=len(train)
# One step ahead forecasting. You should set value of the end to what you prefer
predictions = results.predict(start = start , end = 1 , dynamic=False , typ='levels')
# reverse transformation
predictions = ((b-a)*np.exp(predictions)/(1+np.exp(predictions))) + a
Passing dynamic=False means that forecasts at each point are generated using the full history up to that point (all lagged values).
Passing typ='levels' predicts the levels of the original endogenous variables. If we'd used the
default typ='linear' we would have seen linear predictions in terms of the differenced
endogenous variables.

Related

H2O DistributedRandomForest all tree predictions

I use Python's H2O (version 3.22.1.3), and I was wondering if it is possible to observe each tree's predictions in the Random Forest, like we do in the case of scikit-learn's RandomForestRegressor.estimators_ method. I tried to use h2o.predict_leaf_node_assignment(), but it brings either the prediction path for each tree or (supposedly) the id of the leaf node based on which the prediction was made. In the last version, H2O added the Tree class, but unfortunately, it does not have any predict() method. Although I can access any node in any of the random forest's trees, still my implementation of the tree predict function using the tree's recently implemented API (even if any correct), is extremely slow. So, my question is:
(a) Can I obtain tree predictions natively, and if yes, then how?
(b) If no, do the H2O developers plan to implement this feature in future releases?
Any response would be greatly appreciated.
UPDATE: Thank you, Joe, for your response. As for now (before the feature is directly implemented), here is the only workaround I could think of which generates tree predictions.
# Suppose we have random forest model called drf with ntrees=70 and want to make predictions on df_valid
# After executing the code below, we get a dataframe tree_predictions with ntrees (in our case 70) columns, where i-th column corresponds to the predictions of i-th tree, and the same number of rows as df_valid.
# Extract the trees to create prediction intervals
# Number of trees
ntrees = 70
from h2o.tree import H2OTree
# Extract all the tree of drf, create the list of prediction trees
list_of_trees = [H2OTree(model = drf, tree_number = t, tree_class = None) for t in range(ntrees)]
# leaf_nodes contains the node_id's of tree leaves with predictions
leaf_nodes = drf.predict_leaf_node_assignment(df_valid, type='Node_ID').as_data_frame()
# tree_predictions is the dataframe with predictions for all the 70 trees
tree_predictions = pd.DataFrame(columns=['T'+str(t+1) for t in range(ntrees)])
for t in range(ntrees):
tr = list_of_trees[t]
node_ids = np.array(tr.node_ids)
treePred = lambda n: tr.predictions[np.where(node_ids==n)[0][0]]
tree_predictions['T'+str(t+1)] = leaf_nodes['T'+str(t+1)].apply(treePred)enter code here
Right now the answer is no. We've created an issue for implementing a new feature in the Tree API. You can track the progress here: https://0xdata.atlassian.net/browse/PUBDEV-6322.

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier()
my_model_name.fit(X,Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
model.get_booster().feature_names
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
EDIT:
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
model.feature_names
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
Y=label
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)
xgb.plot_importance(my_model_name)
plt.show()
This will show the original names.

formatting design matrix for regression

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks
If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.

Patsy: New levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?
The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.
I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.

ml-py svm converges but classifying wrongly

I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)

Categories