The original ARMA algorithm has the following formula:
And here you can see, that ARMA takes p + q + 1 numbers to compute. So, there is no questions about that, that's pretty clear.
But talking about SARIMA algorithm I can't understand one thing. The SARIMA formula is looks like ARMA with exta:
Where S is a number which is stands for seasonal period. S is constant.
So, SARIMA must compute p + q + P + Q + 1 numbers. Just exta P + Q numbers. Not too much, if P = 1 and Q = 2.
But if we use too long period, for example 365 days for everyday time series, SARIMA just can't stop fitting. Look at this to models. The first one takes 9 seconds to fit, while the second one haven't finished fitting after 2 hours!
import statsmodels.api as sm
model = sm.tsa.statespace.SARIMAX(
df.meantemp_box,
order=(1, 0, 2),
seasonal_order=(1, 1, 1, 7)
).fit()
model = sm.tsa.statespace.SARIMAX(
df.meantemp_box,
order=(1, 0, 2),
seasonal_order=(1, 1, 1, 365)
).fit()
And I can't understand that. Mathematically this models are the same - they both takes the same p, q, P and Q. But the second one either takes too long to learn, or is not able to learn at all.
Am I getting something wrong?
First, a possible solution: if you are using Statsmodels v0.11 or the development version, then you can use the following when you have long seasonal effects:
mod = sm.tsa.arima.ARIMA(endog, order=(1, 0, 2), seasonal_order=(1, 1, 1, 365))
res = mod.fit(method='innovations_mle', low_memory=True, cov_type='none')
The main restriction is that your time series cannot have missing entries. If you are missing values, then you will need to impute them somehow before creating the model.
Also, not all of our results features will be available to you, but you can still print the summary with the parameters, compute the loglikelihood and information criteria, compute in-sample predictions, and do out-of-sample forecasting.
Now to explain what the problem is:
The problem is that these models are estimated by putting them in state space form and then applying the Kalman filter to compute the log-likelihood. The dimension of the state space form of an ARIMA model grows quickly with the number of periods in a complete season - for your model with s=365, the dimension of the state vector is 733.
The Kalman filter requires multiplying matrices with this dimension, and by default, memory is allocated for matrices of this dimension for each of your time periods. That's why it takes forever to run (and it takes up a lot of memory too).
For the solution above, instead of computing the log-likelihood using the Kalman filter, we compute it something called the innovations algorithm. Then we only run the Kalman filter once to compute the results object (this allows for e.g. forecasting). The low_memory=True option instructs the model not to store all of the large-dimensional matrices for each time step, and the cov_type=None option instructs the model not to try to compute standard errors for the model's parameters (which would require a lot more log-likelihood evaluations).
Related
I think to my naked eye that there are seasonal time series that, when I use adfuller(), the results show the series is stationary based on p values.
I have also applied seasonal_decompose() with it. The results were pretty much what I expected
tb3['percent'].plot(figsize=(18,8))
what the series look like
One thing to note is that my data is collected every minute.
tb3.index.freq = 'T'
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(tb3['percent'].values,freq=24*60, model='additive')
result.plot();
the result of ETS decompose are shown in the figure below
ETS decompose
We can see a clear seasonality, which is same as what i expect
But when use adfuller()
from statsmodels.tsa.stattools import adfuller
result = adfuller(tb3['percent'], autolag='AIC')
the p-value is less than the 0.05, which means this series is stationary.
Can anyone tells me why that happened? how can i fix it?
Because I want to use the SARIMA model to predict furture values, while use the ARIMA model predicts always a constant value of furture.
An Augmented Dickey Fuller test examines whether the coefficient in the regression
y_t - y_{t-1} = <deterministic terms> + c y_{t-1} + <lagged differences>
is equal to 1. It does not usually have power against seasonal deterministic terms, and so it is not surprising that you are not rejecting using adfuller.
You can use a stationary SARIMA model, for example
SARIMAX(y, order=(p,0,q), seasonal_order=(ps, 0, qs, 24*60))
where you set the AR, MA, seasonal AR, and seasonal MA orders as needed.
This model will be quite slow and memory intensive since you have 24 hours of minutely data and so a 1440 lag seasonal.
The next version of statsmodels, which has been released as statsmodels 0.12.0rc0, adds initial support for deterministic processes in time series models which may simplify modeling this type of series. In particular, it would be tempting to use a low order Fourier deterministic sequence. Below is an example notebook.
https://www.statsmodels.org/devel/examples/notebooks/generated/deterministics.html
I am new to pymc3, but I've heard it can be used to build a Bayesian update model. So I tried, without success. My goal was to predict which day of the week a person buys a certain product, based on prior information from a number of customers, as well as that person's shopping history.
So let's suppose I know that customers in general buy this product only on Mondays, Tuesdays, Wednesdays, and Thursdays only; and that the number of customers who bought the product in the past on those days is 3,2, 1, and 1, respectively. I thought I would set up my model like this:
import pymc3 as pm
dow = ['m', 'tu', 'w','th']
c = np.array([3, 2, 1, 1])
# hyperparameters (initially all equal)
alphas = np.array([1, 1, 1, 1])
with pm.Model() as model:
# Parameters of the Multinomial are from a Dirichlet
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is from a Multinomial distribution
observed_data = pm.Multinomial(
'observed_data', n=7, p=parameters, shape=4, observed=c)
So this set up my model without any issues. Then I have an individual customer's data from 4 weeks: 1 means they bought the product, 0 means they didn't, for a given day of the week. I thought updating the model would be as simple as:
c = np.array([[1, 0,0,0],[0,1,0,0],[1,0,0,0],[1,0,0,0],[1,0,0,1])
with pm.Model() as model:
# Parameters are a dirichlet distribution
parameters = pm.Dirichlet('parameters', a=alphas, shape=4)
# Observed data is a multinomial distribution
observed_data = pm.Multinomial(
'observed_data',n=1,p=parameters, shape=4, observed=c)
trace = pm.sample(draws=100, chains=2, tune=50, discard_tuned_samples=True)
This didn't work.
My questions are:
Does this still take into account the priors I set up before, or does it create a brand-new model?
As written above, the code didn't work as it gave me a "bad initial energy" error. Through trial and error I found that parameter "n" has to be the sum of the elements in observations (so I can't have observations adding up to different n's). Why is that? Surely the situation I described above (where some weeks they shop only on Mondays, and others on Mondays and Thursday) is not impossible?
Is there a better way of using pymc3 or a different package for this type of problem? Thank you!
To answer your specific questions first:
The second model is a new model. You can reuse context managers by changing the line to just with model:, but looking at the code, that is probably not what you intended to do.
A multinomial distribution takes n draws, using the provided probabilities, and returns one list. pymc3 will broadcast for you if you provide an array for n. Here's a tidied version of your model:
with pm.Model() as model:
parameters = pm.Dirichlet('parameters', a=alphas)
observed_data = pm.Multinomial(
'observed_data', n=c.sum(axis=-1), p=parameters, observed=c)
trace = pm.sample()
You also ask about whether pymc3 is the right library for this question, which is great! The two models you wrote down are well known, and you can solve the posterior by hand, which is much faster: in the first model, it is a Dirichlet([4, 3, 2, 2]), and in the second Dirichlet([5, 2, 1, 2]). You can confirm this with PyMC3, or read up here.
If you wanted to expand your model, or chose distributions that were not conjugate, then PyMC3 might be a better choice.
Suppose I have the following data:
array([[0.88574245, 0.3749999 , 0.39727183, 0.50534724],
[0.22034441, 0.81442653, 0.19313024, 0.47479565],
[0.46585887, 0.68170517, 0.85030437, 0.34167736],
[0.18960739, 0.25711086, 0.71884116, 0.38754042]])
and knowing that this data follows normal distribution. How do I calculate the AIC number ?
The formula is
2K - 2log(L)
K is the total parameters, for normal distribution the parameter is 3(mean,variance and residual). i'm stuck on L, L is suppose to be the maximum likelihood function, I'm not sure what to pass in there for data that follows normal distribution, how about for Cauchy or exponential. Thank you.
Update: this question appeared in one of my coding interview.
For a given normal distribution, the probability of y given
import scipy.stats
def prob( y = 0, mean = 0, sd = 1 ):
return scipy.stats.norm( mean, sd ).pdf( y )
For example, given mean = 0 and sd = 1, the probability of value 0, is prob( 0, 0, 1 )
If we have a set of values 0 - 9, the log likelihood is the sum of the log of these probabilities, in this case the best parameters are the mean of x and StDev of x, as in :
import numpy as np
x = range( 9 )
logLik = sum( np.log( prob( x, np.mean( x ), np.std( x ) ) ) )
Then AIC is simply:
K = 2
2*K - 2*( logLik )
For the data you provide, I am not so sure what the three columns and row reflect. So do you have to calculate three means and three StDev-s? It's not very clear.
Hopefully this above can get you started
I think the interview question leaves out some stuff, but maybe part of the point is to see how you handle that.
Anyway, AIC is essentially a penalized log likelihood calculation. Log likelihood is great -- the greater the log likelihood, the better the model fits the data. However, if you have enough free parameters, you can always make the log likelihood greater. Hmm. So various penalty terms, which counter the effect of more free parameters, have been proposed. AIC (Akaike Information Criterion) is one of them.
So the problem, as it is stated, is (1) find the log likelihood for each of the three models given (normal, exponential, and Cauchy), (2) count up the free parameters for each, and (3) calculate AIC from (1) and (2).
Now for (1) you need (1a) to look up or derive the maximum likelihood estimator for each model. For normal, it's just the sample mean and sample variance. I don't remember the others, but you can look them up, or work them out. Then (1b) you need to apply the estimators to the given data, and then (1c) calculate the likelihood, or equivalently, the log likelihood of the estimated parameters for the given data. The log likelihood of any parameter value is just sum(log(p(x|params))) where params = parameters as estimated by maximum likelihood.
As for (2), there are 2 parameters for a normal distribution, mu and sigma^2. For an exponential, there's 1 (it might be called lambda or theta or something). For a Cauchy, there might be a scale parameter and a location parameter. Or, maybe there are no free parameters (centered at zero and scale = 1). So in each case, K = 1 or 2 or maybe K = 0, 1, or 2.
Going back to (1b), the data look a little funny to me. I would expect a one dimensional list, but it seems like the array is two dimensional (with 4 rows and 4 columns if I counted right). One might need to go back and ask about that. If they really mean to have 4 dimensional data, then the conceptual basis remains the same, but the calculations are going to be a little more complex than in the 1-d case.
Good luck and have fun, it's a good problem.
I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...
Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).
I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.
Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)
The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.
#Yaron's approach needs an extra step (4. below):
Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2.
AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at #remykarem's mixed-naive-bayes as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).
For hybrid features, you can check this implementation.
The author has presented mathematical justification in his Quora answer, you might want to check.
You will need the following steps:
Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
Multiply 1. and 2. AND
Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
Divide 4. by the sum (over the classes) of 4. This is the normalisation step.
It should be easy enough to see how you can add your own prior instead of using those learned from the data.