I'm rather new at programming at general so do forgive me if the question is rather basic.
I'm trying to determine my p, d, q values for an ARIMA model and I've already conducted an adfuller test that determined that my time series is stationary. However, when I plot out my ACF and PACF plots, I get the following:
ACF plot
PACF plot
From what I've read about the p values, I'm supposed to pick the value where the line first crosses the confidence interval except I'm not sure why my confidence intervals for both are that small? Does this mean that my MA value should be 2 according to the PACF plot? Any help in interpreting the graphs would be appreciated!
My code:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig = plt.figure(figsize=(20, 12))
fig = plot_acf(train_set.dropna(), lags=10)
fig = plot_pacf(train_set.dropna(), lags=10)
The d component is used to make the data stationary by differencing, if the adf test (and kpss test) shows that the data is stationary, you can probably set it to 0. However, keep in mind that you cannot trust these tests by 100 %.
The confidence interval indicates whether the correlation is statistically significant, meaning that the correlation is very likely not to be random. All bars that cross the confidence interval are “real” correlations that you can use for modeling.
There are thousands of thumb rules to interpret these plots. I recommend the following:
If the ACF trails off, use an MA model with the significant and strong correlations from the PACF.
If the PACF trails off, use an AR model with the significant and strong correlations from the ACF.
You can also have a look here:
https://towardsdatascience.com/identifying-ar-and-ma-terms-using-acf-and-pacf-plots-in-time-series-forecasting-ccb9fd073db8
I guess you created the plots with statsmodels, in that case you shoud keep in mind that lag 0 (the first in the plots) is the correlation of the time series with itself, therefore it will always be +1 and significiant, you can ignore this lag.In your case, the ACF is trailing off, and the PACF has only one statistically significant and strong correlation with the first lag, perhaps you can also use 2, 3 and 4 but they are very weak. Best is of course if you just try it out. Or you can use pmdarima’s auto_arima() function:
https://alkaline-ml.com/pmdarima/tips_and_tricks.html
https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html
Related
I am getting a negative confidence interval for a linear regression plot even though all data points are positive. Why is this happening? I believe this negative confidence interval will also affect my R^2 score?
Code used is:
sns.regplot(x = 'Consumer Confidence Index_1', y = 'Sales (ALV
sources)', data = df_mx2)
plt.show()
See graph pic here
One of the foundational assumptions for a linear regression is that the data is normally distributed about the line. In your case you have data on the right side and the left side with a big gap in the middle. As such, you should double check that a linear regression is appropriate for your analysis.
That being said, rest easy, the negative confidence interval will NOT effect your R² value.
The reason for the negative confidence interval has to do with the sparsity of data with x<42. If the three points on the right side were removed, the regression would have a positive slope intersecting the x axis around x=42. If that line were extended to x=30 or so it would be very negative. As such the data suggests that to hit the confidence threshold you have set, the confidence interval must be very large to include data that potentially lines up with the steeper regression line.
This can be interpreted as the data provides very little in the way of predictive ability below x=42.
What does the background in blue mean or determine in the regression plot when using seaborn? What determines its width at both ends?
According to seaborn documentation, that area rappresents the confidence interval. You can set it through the ci parameter:
Size of the confidence interval for the regression estimate. This will
be drawn using translucent bands around the regression line. The
confidence interval is estimated using a bootstrap; for large
datasets, it may be advisable to avoid that computation by setting
this parameter to None
For the statistical meaning of confidence interval, I suggest you the wikipedia definition:
The confidence interval represents values for the population parameter
for which the difference between the parameter and the observed
estimate is not statistically significant at the 10% level
Strictly speaking, there is 95% of probability (seaborn uses 95% as default value) that a new sample falls in the confidence interval. In practice, the confidence interval indicates the forecast error associated with data dispersion.
I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?
The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.
I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html
I have used the Elo and Glicko rating systems along with the results for matches to generate ratings for players. Prior to each match, I can generate an expectation (a float between 0 and 1) for each player based on their respective ratings. I would like test how accurate this expectation is, for two reasons:
To compare the difference rating systems
To tune variables (such as kfactor in Elo) used to calculate ratings
There are a few differences from chess worth being aware of:
Possible results are wins (which I am treating as 1.0), losses (0.0), with the very occasional (<5%) draws (0.5 each). Each individual match is rated, not a series like in chess.
Players have less matches -- many have less than 10, few go over 25, max is 75
Thinking the appropriate function is "correlation", I have attempted creating a DataFrame containing the prediction in one column (a float between 0, 1) and the result in the other (1|0.5|0) and using corr(), but based on the output, I am not sure if this is correct.
If I create a DataFrame containing expectations and results for only the first player in a match (the results will always be 1.0 or 0.5 since due to my data source, losers are never displayed first), corr() returns very low: < 0.05. However, if I create a series which has two rows for each match and contains both the expectation and result for each player (or, alternatively, randomly choose which player to append, so results will be either 0, 0.5, or 1), the corr() is much higher: ~0.15 to 0.30. I don't understand why this would make a difference, which makes me wonder if I am either misusing the function or using the wrong function entirely.
If it helps, here is some real (not random) sample data: http://pastebin.com/eUzAdNij
An industry standard way to judge the accuracy of prediction is Receiver Operating Characteristic (ROC). You can create it from your data using sklearn and matplotlib with this code below.
ROC is a 2-D plot of true positive vs false positive rates. You want the line to be above diagonal, the higher the better. Area Under Curve (AUC) is a standard measure of accuracy: the larger the more accurate your classifier is.
import pandas as pd
# read data
df = pd.read_csv('sample_data.csv', header=None, names=['classifier','category'])
# remove values that are not 0 or 1 (two of those)
df = df.loc[(df.category==1.0) | (df.category==0.0),:]
# examine data frame
df.head()
from matplotlib import pyplot as plt
# add this magic if you're in a notebook
# %matplotlib inline
from sklearn.metrics import roc_curve, auc
# matplot figure
figure, ax1 = plt.subplots(figsize=(8,8))
# create ROC itself
fpr,tpr,_ = roc_curve(df.category,df.classifier)
# compute AUC
roc_auc = auc(fpr,tpr)
# plotting bells and whistles
ax1.plot(fpr,tpr, label='%s (area = %0.2f)' % ('Classifier',roc_auc))
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate', fontsize=18)
ax1.set_ylabel('True Positive Rate', fontsize=18)
ax1.set_title("Receiver Operating Characteristic", fontsize=18)
plt.tick_params(axis='both', labelsize=18)
ax1.legend(loc="lower right", fontsize=14)
plt.grid(True)
figure.show()
From your data, you should get a plot like this one:
Actually, what you observe makes perfectly sense. If there were no draws and you would always show the expectation of the winner in the first row, then there would be no correlation with the second row at all! Because no matter how big or small the expectation, the number in the second row is always 1.0, i.e. it does not depend on the number in the first row at all.
Due to a low percentage of draws (draws probably correlate with the values around 0.5) you still can observe a small correlation.
Maybe the correlation is not the best measure for the accuracy of the predictions here.
One of the problems is, that the Elo does not predict the single result but the expected amount of points. There is at least one unknown factor: The probability of the draw. You have to put additional knowledge about the probability of the draw into your models. This probability is dependent on the strength difference between the players: the bigger the difference the smaller the chance of a draw. One could try the following approaches:
mapping expected points onto expected results, e.g. 0...0.4 means a loss, 0.4..0.6 - a draw and 0.6...1.0 - a win and see how many results are predicted correctly.
For a player and a bunch of games, the measure for accuracy would be |predicted_score-score|/number_of_games averaged over the players. The smaller the difference, the better.
A kind of Bayesian approach: if for a game the predicted amount of points is x than the score of the predictor is x if the game were won and 1-x if the game were lost (maybe you have to skip the draws or score them as (1-x)*x/4 - thus the prediction of 0.5 would have the score of 1). The overall score of the predictor over all games would be the product of the single game scores. The bigger the score, the better.