Time Series AR model shows NaNs for prediction - python

I'm running the below code for AR model and it returns blanks
Can someone help me debug this.
# With Headers
df = pd.read_sql(sql_query, cnxn,index_col='date',parse_dates=True)
#index col is required to make sure stasmodel on this dataset we need to set index frequency
df.index.freq = 'MS'
df.to_csv("Billings.csv")
# write back to an excel for audits and testing
#train test split
train_data = df.iloc[:len(df)-12]
test_data = df.iloc[len(df)-12:]
from statsmodels.tsa.ar_model import AR,ARResults
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
model = AR(train_data['tcv'])
AR1fit = model.fit(maxlag=1,method='mle') #max_lag tells you how many co efficients to take or what model type it is. E.g. AR1
print(f'Lag: {AR1fit.k_ar}')
print(f'Coefficients:\n{AR1fit.params}')
# general format for obtaining predictions
start=len(train_data)
end=len(train_data)+len(test_data)-1
predictions1 = AR1fit.predict(start=start, end=end, dynamic=False).rename('AR(1) Predictions')
predictions1
Output:
Results of print statements

Thank you for uploading the results of the print statements!
As you can see the value of L1.tcv parameter is NaN. BTW, To get a better picture of the model fit, you can also do:
print(AR1fit.summary())
In any case, this explains why you get NaNs in your predictions - because any computation with NaN will result in NaN.
However, fixing this is another kettle of fish. If you look at the vignette here, you can see they use dropna in block [3].
I suspect that if you did something similar on your train set, train_data['tcv'].dropna(), this could fix your predictions.

Related

How do I forecast data using ARIMA?

I wanted to forecast stock prices using the ARIMA model (Autoregressive Moving Average) and wanted to plot the forecasted data over the actual and training data. I'm following this tutorial and have browsed others too. But they all follow the same code. Here is the link to their tutorial for your reference:(https://www.analyticsvidhya.com/blog/2021/07/stock-market-forecasting-using-time-series-analysis-with-arima-model/)
# Forecast
fc, se, conf= fitted.forecast(216, alpha=0.05) # 95% conf
I was expecting a graph that looks like this
Instead, an error message shows up: ValueError: too many values to unpack (expected 3)
please help :')
Edit: I tried doing that before and it produces an error message in the next code. My next line of codes are as the following:
result = fitted.forecast(216, alpha =0.05)`
# Make as pandas series
fc_series = pd.Series(result, index=test_data.index)
lower_series = pd.Series(result[:, 0], index=test_data.index)
upper_series = pd.Series(result[:, 1], index=test_data.index)
The error message: KeyError: 'key of type tuple not found and not a MultiIndex'
It seems, that the forecast function is not returning three return values anymore. This may happen if you don’t use the same version as in the tutorial.
Please try something like:
result = fitted.forecast(216, alpha=0.05)
And then inspect the result if it does contain all the data you need.
import library
import statsmodels.api as sm
use a model with sm.tsa
model = sm.tsa.ARIMA(train_data, order=(1, 1, 1))
fitted = model.fit()
print(fitted.summary())
pass a parameter Summary_frame to get a forecast , lower and upper interval
result = fitted.get_forecast(216, alpha =0.05).summary_frame()
print(result)
Make pandas series, dont forget add values to get series not null.
fc_series = pd.Series(result['mean'].values, index=test_data.index)
lower_series = pd.Series(result['mean_ci_lower'].values, index=test_data.index)
upper_series = pd.Series(result['mean_ci_upper'].values, index=test_data.index)
I hope this help you.

StandardScaler in Python

I want to standardize 'x_train'.
The first 'x_train' in the picture is the original data set, and the next 'x_train' below the previous one is standardized.
I just want to standardize the first six columns, so I wrote x_train[:,0:6] during standardization.
However, the result of standardization is obviously unreasonable. Moreover, when I use the mean and standard deviation of 'x_train' to standardize x_test, the result went right. It's weird. I have no idea what's wrong with my code.
Below is my code for standardizing.
Try -
scaler = preprocessing.StandardScaler().fit(x_train.iloc[:, 0:6])
#returning the scaled values to a new variable
X_train_first_six = scaler.transform(x_train.iloc[:, 0:6])
X_test_first_six = scaler.transform(x_test.iloc[:, 0:6])
ref. pandas iloc

Order of priors in sklearn LinearDiscriminantAnalysis

I'm fitting a Linear Discriminant Analysis model using the stock market data (Smarket.csv) from here. I'm trying to predict Direction with columns Lag1 and Lag2. Direction has two values: Up or Down.
Here is my reproducible code and the result:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
url="https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Smarket.csv"
Smarket=pd.read_csv(url, usecols=range(1,10), index_col=0, parse_dates=True)
X_train = Smarket[:'2004'][['Lag1', 'Lag2']]
y_train = Smarket[:'2004']['Direction']
LDA = LinearDiscriminantAnalysis()
model = LDA.fit(X_train, y_train)
print(model.priors_)
[0.49198397 0.50801603]
How do I know which prior value corresponds to which class (Up or Down)? I looked at the documentation but there seems to be nothing.
Can someone explain it to me or point me to a resource that explains this?
Although I cannot find an explicit reference in the documentation (I'm sure there is a general one, somewhere), in such cases the classes are ordered alphabetically, ie. in your case it is ['Down', 'Up'].
You can easily verify that this is consistent with your results here; since the priors_ attribute is just passed through the priors argument, which, according to the documentation, is just the class proportions as inferred from the training data (when priors=None, like here):
y_train.value_counts(normalize=True)
gives:
Up 0.508016
Down 0.491984
Name: Direction, dtype: float64
and
model.priors_[0] == (y_train.value_counts(normalize=True)['Down']
# True
model.priors_[1] == (y_train.value_counts(normalize=True)['Up']
# True

Unable to Apply Scikit-Learn Imputer To A Dataset With Two Features

I am trying to read in the complete Titanic dataset, which can be found here:
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
import pandas as pd
# Importing the dataset
dataset = pd.read_excel('titanic3.xls')
y = dataset.iloc[:, 1].values
x = dataset.iloc[:, 2:14].values
# Create Dataset for Men
men_on_board = dataset[dataset['sex'] == 'male']
male_fatalities = men_on_board[men_on_board['survived'] ==0]
X_male = male_fatalities.iloc[:, [4,8]].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X_male[:,0])
X_male[:,0] = imputer.transform(X_male[:,0])
When I run all but the last line, I get the following warning:
/Users/<username>/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
When I run the last line, it throws the following error:
File "<ipython-input-2-07afef05ee1c>", line 1, in <module>
X_male[:,0] = imputer.transform(X_male[:,0])
ValueError: could not broadcast input array from shape (523) into shape (682)
I've used the above code snippet for imputation on other projects, not sure why it's not working.
A quick solution is to change axis = 0 to axis = 1. This will make it work, though I'm not sure if that's what you want. So I want to give some explanation about what happened here as following:
The warning basically tells you sklearn estimator now requires 2D data arrays rather than 1D data arrays where interpreting data as samples (rows) vs as features (columns) matters. During this deprecation process, this requirement is enforce by np.atleast_2d which assume your data has a single sample (row). Meanwhile, you passed axis = 0 to the Imputer which "impute along columns" by strategy = 'mean'. However, you have only 1 row now. When it comes across a missing value, there is no mean to replace that missing value. Therefore the entire column (which contains just this missing value) is discarded. As you can see, this is equal to
X_male[:,0][~np.isnan(X_male[:,0])].reshape(1, -1)
That's why the assignment X_male[:,0] = imputer.transform(X_male[:,0]) failed: X_male[:,0] is shape(682) while imputer.transform(X_male[:,0]) is shape(523). My previous solution basically changes it to "impute along rows" where you do have mean to replace missing values. You won't drop anything this time and your imputer.transform(X_male[:,0]) is shape(682) which can be assigned to X_male[:,0].
Now I don't know why your code snippet for imputation works on other projects. For your specific case here, a (logically) better way in regarding to the deprecation warning could be using X.reshape(-1, 1) since your data has a single feature and 682 samples. However, you need to reshape the transformed data back before being able to be assigned to X_male[:,0]:
imputer = imputer.fit(X_male[:,0].reshape(-1, 1))
X_male[:,0] = imputer.transform(X_male[:,0].reshape(-1, 1)).reshape(-1)

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories