Python - How to Select Certain Lags in ARIMA Model? - python

I want to fit a model = ARIMA(ret_log, order=(5,0,0)), but with second lag and third lag in AR part set to zero due to non-significant autocorrelation, how can I do it in Python? I know in R it is easily doable.
I've seen similar questions been asked for R, but only one such question be asked for Python(Link Here). However, the answer does not seem to work, nor do I think the person who raised the question was satisfied.
I tried tsa.arima.model.ARIMA.fix_params and tsa.arima.model.ARIMA.fit_constrained, but both threw out AttributeError, such as 'ARMA' object has no attribute 'fit_constrained'.
Anyone has any idea? Thanks.

As mentined in statsmodels.tsa.arima.model.ARIMA documentation
p and q may either be integers or lists of integers
. So you can easily input the lags to include as a list.
Please note that I have not tried it and cannot guarantee that it will work.

Related

CLARANS pyclustering implementation - is there a mistake in the code?

I know my question is a bit weird but I am trying to implement my own version of CLARANS algorithm for sake of learning. To understand it better, I tried to get through the CLARANS code of pyclustering library (which is outdated but still seems popular and I've seen it used in some places). Here it is:
https://github.com/annoviko/pyclustering/blob/master/pyclustering/cluster/clarans.py
I understood everything (or thought so), until the line 210, so just before the cost calculation takes place.
distance_nearest = float('inf')
if ( (point_medoid_index != candidate_medoid_index) and (point_medoid_index != current_medoid_cluster_index) ):
distance_nearest = euclidean_distance_square(self.__pointer_data[point_index], self.__pointer_data[point_medoid_index])
Is there a bug inside the library? Let's say we have 1000 data points and 3 clusters (so 3 medoids of course).
Why would we say the distance_nearest = float('inf') for any of the points (especially since we're adding the distance_nearest later in the code)? And what is more, why would we compare the index of analyzed point's medoid (so could be equal to 400) to the current_medoid_cluster_index (which could only take values from 0 to 2)? What's the point of that?
I'm sorry if it's a bit chaotic, but I'm honestly looking for someone either interested in going through the code or someone who understands the code already - I am willing to elaborate further if needed.
If someone understands it and knows there is no bug, could you please explain the cost calculation part?

How to find the reverse of power_mod(a,b,c)

When using power_mod(a,b,c) we get a^b % c returning x. I have a,c, and x but am having a difficult time reversing the %c.
Is there a function that already exists to reverse this, or would I need to implement Euclidean algorithm to find what b is and return that?
This sounds like the "discrete logarithm" problem. If not, please be clearer (e.g., give a specific numeric example). But, if so, as the cited article says, there is no efficient approach known for the general case.
Since you tagged "sage" in the question, see the docs for the Sage discrete_log() function too. It implements some of the approaches named in the earlier cited Wikipedia article ("Pohlig-Hellman and baby step giant step").

What should I do if there are too many zero values in the outlier handling part?

I am working on a data science project which is about Churn analysis(whether costomer is leaving or not). I am trying to do outlier handling part but i have a question about how i need to think when my data has many zero values. I know it may contain a meaning but please see the results below.
Results ,Value Counts, z score-hard edges and outliers
I would like to ask what should i need to do for better results and should i keep all the zero values? Any suggestion?
What should I do if there are too many zero values in the outlier handling part?
This question is too broad to be asked here. Stackoverflow is mainly for programming questions, I recommend you to post your question on stats or data-science as your question would have more potential to be answered in a broader way.
I guess 0 values are not missing as #yatu suspected, inferred from the colname, it means no change in revenue. Moreover, 0s are not the outlier values.
Refer to this similar discussion.
I can suggest one more reading, but the paper would convey the intuition where it does not discuss your problem explicitly. Yet, you may find it useful. Of course, do not forget to search for references.
Further reading: A Statistical Model for Big Data with Excessive Zero-Inflated Problem

In sklearn.linear_model.Ridge, what exactly is the solverparameter doing?

In the sklearn.linear_model.Ridge method, there is a parameter, solver : {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}.
And according to the documentation, we should choose different parameter depending on different types' values which is dense or sparse or just use auto.
So in my opinion, we just choose a specific parameter to make calculations fast to corresponding data.
Are my thoughts right or wrong?
If not mind, could anyone give me some advice because I didn't search and get anything proving my thoughts or not?
Sincerely thanks.
You are almost right.
Some solver work only with specific type of data (dense vs. sparse) or specific type of problem (non-negative weights).
However for many cases you can use multiple solvers (e.g. for sparse problems you have at least sag, sparse_cg and lsqr). These solvers have different characteristics and some of them might work better in some cases and some of them work better in other cases. In some cases some solvers even do not converge.
In many cases, the simple engineering-like answer is to use solver which works best on your data. You just test all of them and measure time to answer.
If you want, more precise answer you should dig into documentation of referenced methods (e.g. scipy.sparse.linalg.lsqr).

How different methods of getting spectra in Python actually work?

I have several signals as columns in a pandas dataframe (each of them has some NaNs at the beginning or the end, because they all cover slightly different intervals). The signal has some sort of a trend (which basically means a high-wavelength portion with value in the order of X00) and some small wiggles (values in the order of X - X0). I would like to compute a spectrum of each of these columns, and I expect to see two peaks on it - one in the X-X0 and the other one in X00 (which suggests that I should work on a log scale).
However, the "spectra" that I produced using several different methods (scipy.signal.welch and numpy.fft.fft) do not look like the expected output. (peaks always at 20 and 40).
Here are several aspects that I don't understand:
Is there any time series processing inbuilt somewhere deep in these functions so that they actually don't work if I work with wavelengths instead of periods/frequencies?
I found the documentation rather confusing and not very helpful, in
particular when it comes to input parameters and the output. Do I
include the signal as it is or do I need to do some sort of
pre-processing before? Should I then include the sampling frequency
(i.e. 1/wavelength sampling interval, which, in my case, would be
let's say 1/0.01 per m) or the sampling interval (i.e. 0.01 m)? Does
the output show the same unit or 1/the unit? (I tried all
combinations of unit and 1/unit and none of them yielded a
reasonable result, so there is another problem here, too, but I
still am uncertain with this.)
Should I use yet another method/are these not suitable?
Not even sure if these are the right questions to ask, but I am afraid if I knew what question to ask, I would know the answer.
Disclaimer: I am not proficient at signal processing, so I am not actually sure if my issue is really with python or in deeper understanding of the problem.
Even if I try a very simple example, I don't understand the behaviour:
x = np.arange(0,10,0.01)
x = x * np.pi
y = np.sin(0.2*x)+np.cos(3*x)
freq, spec = sp.signal.welch(y,fs=(1/(0.01*pi)))
I would expect to see two peaks in the spectrum, one at ~15 and another one at ~2. Or if it is still in frequency, then at ~1/15 and 1/2. But this is what I get in the first case and this is what I get if I plot 1/freq instead of freq: - the 15 is even out of range! So I don't know what I am actually plotting.
Thanks a lot.

Categories