How to use exponential smoothing to smooth the timeseries in python? - python

I am trying to use exponential smooting to smooth a timeseries.
Suppose my timeseries looks like this:
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
I want to get the smoothed version of that timeseries.
If I did something like this;
from statsmodels.tsa.api import ExponentialSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
it only outputs the forcasted three values, but not the smoothed version of my original timeseries. Is there a way to get the smoothed version of my original timeseries?
I am happy to provide more details if needed.

You can get the smoothed values in the fittedvalues attribute of the model, apparently.
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
from statsmodels.tsa.api import SimpleExpSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
# fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
import matplotlib.pyplot as plt
plt.plot(oildata)
plt.plot(fit1.fittedvalues)
plt.show()
It yields:
The documentation states:
fittedvalues: ndarray
An array of the fitted values. Fitted by the Exponential Smoothing model.
Note that you can also use the fittedfcast attribute which contains all values + the first forecast, or the fcastvalues attribute which contains the forecast only.

ExponentialSmoothing is not to a tool to smoothen time series data, it is a time series forecasting method.
The fit() function will return an instance of the HoltWintersResults class that contains the learned coefficients. The forecast() or the predict() function on the result object can be called to make a forecast.
So by calling predict, what the class will doing is providing a forecast using the learned coefficients.
In order to smoothen the time series however, you can use the fittedvalues attribute, as #smarie points out
However, I'd go with a more appropriate tool, such as a savgol_filter:
from scipy.signal import savgol_filter
savgol_filter(oildata, 5, 3)
array([444.87816 , 461.58666 , 444.99296 , 441.70785143,
442.40769143, 438.36852857, 441.50125714, 472.05622571,
512.20891429, 521.74822857, 517.63141429, 493.37037143])
As mentioned in the comments, the savgol filter performs a local taylor approximation of a given polyorder on a given window size (window_length) and results in a smoothing of the time series.
Here's what it would look like with the above set up:
plt.plot(oildata)
plt.plot(pd.Series(savgol_filter(oildata, 5, 3), index=oildata.index))
plt.show()

Related

SHAP plotting waterfall using an index value in dataframe

I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])

how to use Copulae in Python

I'm using Copulae package, following the example of https://github.com/DanielBok/copulae.
My understanding is that the simulated values should have similar distribution as the input ones.
I would like to see the output dataframe of the fit (that is: the simulated data), and check its distribution. However, "fitted" produced below is not an array or df that I can open.
How can I extract the fitted data?
from copulae import NormalCopula
import numpy as np
np.random.seed(8)
data = np.random.normal(size=(300, 8))
plt.hist(data[:,1], bins=100) #checking input data histogram
cop = NormalCopula(8)
cop.fit(data) #fitting data with copula
fitted=cop.fit(data)

Get better fit on test data using Auto_Arima

I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:

Getting transformed X values from OLS model using statsmodels

I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.

Extract nominal and standard deviation from ufloat inside a panda dataframe

For convenience purpose I am using pandas dataframes in order to perform an uncertainty propagation on a large set on data.
I then wish to plot the nominal value of my data set but something like myDF['colLabel'].n won't work. How to extract the nominal and standard deviation from a dataframe in order to plot the nominal value and the errorbar?
Here is a MWE to be more consistent:
#%% MWE
import pandas as pd
from uncertainties import ufloat
import matplotlib.pyplot as plt
# building of a dataframe filled with ufloats
d = {'value1': [ufloat(1,.1),ufloat(3,.2),ufloat(5,.6),ufloat(8,.2)], 'value2': [ufloat(10,5),ufloat(50,2),ufloat(30,3),ufloat(5,1)]}
df = pd.DataFrame(data = d)
# plot of value2 vs. value1 with errobars.
plt.plot(x = df['value1'].n, y = df['value2'].n)
plt.errorbar(x = df['value1'].n, y = df['value2'].n, xerr = df['value1'].s, yerr = df['value2'].s)
# obviously .n and .s won't work.
I get as an error AttributeError: 'Series' object has no attribute 'n' which suggest to extract the values from each series, is there a shorter way to do it than going through a loop which would separate the nominal and std values into two separated vectors?
Thanks.
EDIT: Using those functions from the package won't work either: uncertainties.nominal_value(df['value2']) and uncertainties.std_dev(df['value2'])
Actually solved it with the
unumpy.nominal_values(arr) and unumpy.std_devs(arr) functions from uncertainties.

Categories